1899575 – update discovery burst to reflect lots of CRDs on openshift clusters

Bug 1899575 - update discovery burst to reflect lots of CRDs on openshift clusters

Summary: update discovery burst to reflect lots of CRDs on openshift clusters

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	oc
Sub Component:
Version:	4.5
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Maciej Szulik
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Duplicates (4):	1869847 1894574 1902816 1909280 (view as bug list)
Depends On:
Blocks:	1906332 2049157
TreeView+	depends on / blocked

Reported:	2020-11-19 15:27 UTC by Simon Reber
Modified:	2024-10-01 17:06 UTC (History)
CC List:	35 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Low limit for client throttling. Consequence: Due to increasing number of CRDs installed in the cluster the requests reaching for API discovery were limited by the client code. Fix: Increase the limit number twice the current limit. Result: The client-side throttling should appear less frequently.
Clone Of:
Clones:	1906332 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:34:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift oc pull 645	None	closed	update discovery burst to reflect lots of CRDs on openshift clusters	2021-02-21 16:54:36 UTC
Github	openshift oc pull 648	None	closed	Oc rebase to k8s 1.20.0-beta.2	2021-02-21 16:54:37 UTC
Github	openshift oc pull 696	None	closed	Bug 1899575: bump discovery burst to 250	2021-02-21 16:54:36 UTC
Red Hat Knowledge Base (Solution)	5587221	None	None	None	2020-11-19 15:56:15 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:35:06 UTC

Internal Links: 1927806

Description Simon Reber 2020-11-19 15:27:37 UTC

Description of problem:

In large OpenShift 4 environments we are seeing messages "Throttling request took 1.183079627s, request: GET: ..." when running simple "oc get -A pod". This is related to the amount of objects and CRD's in the OpenShift 4 - Cluster. 

As increasing `config.Burst` helps we would like to have this applied in all `oc` version to prevent the issue from happening.

Version-Release number of selected component (if applicable):

 - 4.5 and 4.6

How reproducible:

 - Always (depending on the numbers of CRD's in the OpenShift - Cluster

Steps to Reproduce:
1. Add many CRD's to OpenShift 4 and add workloads
2. Run `oc get -A pod`

Actual results:

The `oc` command will run but report `Throttling request took 1.183079627s, request: GET: ...`

Expected results:

The `oc` command should run without client side throttling.

Additional info:

Comment 1 Jan Rautenberg 2020-11-20 16:12:28 UTC

We are facing the same problem. Here is what happens when we delete a pv (via oc and tridentctl):

I1120 15:52:30.591442    1304 request.go:621] Throttling request took 1.183797893s, request: GET:https://172.30.0.1:443/apis/machine.openshift.io/v1beta1?timeout=32s
I1120 15:52:40.790566    1304 request.go:621] Throttling request took 8.796730826s, request: GET:https://172.30.0.1:443/apis/operators.coreos.com/v1alpha1?timeout=32s
I1120 15:52:50.990505    1304 request.go:621] Throttling request took 18.996372382s, request: GET:https://172.30.0.1:443/apis/hco.kubevirt.io/v1alpha1?timeout=32s
I1120 15:52:59.468478    1333 request.go:621] Throttling request took 1.162821518s, request: GET:https://172.30.0.1:443/apis/apiextensions.k8s.io/v1beta1?timeout=32s
I1120 15:53:09.667564    1333 request.go:621] Throttling request took 6.598000689s, request: GET:https://172.30.0.1:443/apis/scheduling.k8s.io/v1beta1?timeout=32s
I1120 15:53:19.866742    1333 request.go:621] Throttling request took 16.796448976s, request: GET:https://172.30.0.1:443/apis/triggers.tekton.dev/v1alpha1?timeout=32s
persistentvolume "pvc-07cfc225-1a7d-41f2-996e-c9ecd37fac4f" deleted

and almost every other oc command is throttled, too. 

Here is a list with of the CRDs:

$ oc get crd
I1120 15:55:58.352182   25723 request.go:621] Throttling request took 1.050227415s, request: GET:https://api.dt.ocp.tc.corp:6443/apis/caching.internal.knative.dev/v1alpha1?timeout=32s
NAME                                                             CREATED AT
alertmanagers.monitoring.coreos.com                              2020-07-09T19:16:30Z
apiservers.config.openshift.io                                   2020-07-09T18:58:18Z
authentications.config.openshift.io                              2020-07-09T18:58:19Z
authentications.operator.openshift.io                            2020-07-09T18:58:34Z
baremetalhosts.metal3.io                                         2020-07-09T18:59:05Z
builds.config.openshift.io                                       2020-07-09T18:58:19Z
catalogsources.operators.coreos.com                              2020-07-09T18:58:42Z
cdis.cdi.kubevirt.io                                             2020-10-26T15:09:11Z
certificatedeployments.tollcollect.de                            2020-10-15T12:29:05Z
certmanagers.operator.cert-manager.io                            2020-08-05T13:34:20Z
checlusters.org.eclipse.che                                      2020-10-23T07:48:10Z
cloudcredentials.operator.openshift.io                           2020-09-16T12:15:54Z
clusterautoscalers.autoscaling.openshift.io                      2020-07-09T18:58:33Z
clusternetworks.network.openshift.io                             2020-07-09T19:01:53Z
clusteroperators.config.openshift.io                             2020-07-09T18:58:17Z
clusterresourceoverrides.operator.autoscaling.openshift.io       2020-11-18T14:01:23Z
clusterresourcequotas.quota.openshift.io                         2020-07-09T18:58:18Z
clusterserviceversions.operators.coreos.com                      2020-07-09T18:58:39Z
clustertasks.tekton.dev                                          2020-08-06T09:04:22Z
clustertriggerbindings.triggers.tekton.dev                       2020-08-06T09:05:09Z
clusterversions.config.openshift.io                              2020-07-09T18:58:17Z
conditions.tekton.dev                                            2020-08-06T09:04:22Z
config.operator.tekton.dev                                       2020-08-06T09:03:52Z
configs.imageregistry.operator.openshift.io                      2020-07-09T18:58:30Z
configs.operator.openshift.io                                    2020-07-28T11:04:19Z
configs.samples.operator.openshift.io                            2020-07-09T18:58:31Z
consoleclidownloads.console.openshift.io                         2020-07-09T18:58:31Z
consoleexternalloglinks.console.openshift.io                     2020-07-09T18:58:31Z
consolelinks.console.openshift.io                                2020-07-09T18:58:31Z
consolenotifications.console.openshift.io                        2020-07-09T18:58:30Z
consoles.config.openshift.io                                     2020-07-09T18:58:19Z
consoles.operator.openshift.io                                   2020-07-09T18:58:30Z
consoleyamlsamples.console.openshift.io                          2020-07-09T18:58:31Z
containerruntimeconfigs.machineconfiguration.openshift.io        2020-07-09T19:03:38Z
controllerconfigs.machineconfiguration.openshift.io              2020-07-09T19:03:35Z
credentialsrequests.cloudcredential.openshift.io                 2020-07-09T18:58:22Z
csisnapshotcontrollers.operator.openshift.io                     2020-07-09T18:58:33Z
dmdeployments.dm.<redacted>.de                                  2020-11-09T09:26:37Z
dnses.config.openshift.io                                        2020-07-09T18:58:19Z
dnses.operator.openshift.io                                      2020-07-09T18:58:32Z
dnsrecords.ingress.operator.openshift.io                         2020-07-09T18:58:33Z
egressnetworkpolicies.network.openshift.io                       2020-07-09T19:01:53Z
<redacted>-repositories.cloud.<redacted>.com                                   2020-09-16T11:55:40Z
<redacted>-tenants.cloud.<redacted>.com                                        2020-09-16T11:57:30Z
<redacted>-versions.cloud.<redacted>.com                                       2020-09-16T11:57:56Z
etcds.operator.openshift.io                                      2020-07-09T18:58:33Z
eventlisteners.triggers.tekton.dev                               2020-08-06T09:05:09Z
featuregates.config.openshift.io                                 2020-07-09T18:58:19Z
gitlabs.gitlab.com                                               2020-07-16T13:13:51Z
grafanadashboards.integreatly.org                                2020-08-04T09:36:48Z
grafanadatasources.integreatly.org                               2020-08-04T09:36:48Z
grafanas.integreatly.org                                         2020-08-04T09:36:48Z
helloworlds.mv.<redacted>.com                                      2020-11-04T12:21:40Z
hostpathprovisioners.hostpathprovisioner.kubevirt.io             2020-10-26T15:09:10Z
hostsubnets.network.openshift.io                                 2020-07-09T19:01:53Z
hyperconvergeds.hco.kubevirt.io                                  2020-10-26T15:09:10Z
imagecontentsourcepolicies.operator.openshift.io                 2020-07-09T18:58:20Z
imagemanifestvulns.secscan.quay.redhat.com                       2020-08-24T08:33:05Z
imagepruners.imageregistry.operator.openshift.io                 2020-07-09T18:58:34Z
images.caching.internal.knative.dev                              2020-08-06T09:04:22Z
images.config.openshift.io                                       2020-07-09T18:58:20Z
infrastructures.config.openshift.io                              2020-07-09T18:58:20Z
ingresscontrollers.operator.openshift.io                         2020-07-09T18:58:22Z
ingresses.config.openshift.io                                    2020-07-09T18:58:20Z
installplans.operators.coreos.com                                2020-07-09T18:58:40Z
ippools.whereabouts.cni.cncf.io                                  2020-07-09T19:01:47Z
kafkabridges.kafka.strimzi.io                                    2020-09-14T11:57:40Z
kafkaconnectors.kafka.strimzi.io                                 2020-09-14T11:57:40Z
kafkaconnects.kafka.strimzi.io                                   2020-09-14T11:57:40Z
kafkaconnects2is.kafka.strimzi.io                                2020-09-14T11:57:41Z
kafkamirrormaker2s.kafka.strimzi.io                              2020-09-14T11:57:41Z
kafkamirrormakers.kafka.strimzi.io                               2020-09-14T11:57:41Z
kafkarebalances.kafka.strimzi.io                                 2020-09-14T11:57:41Z
kafkas.kafka.strimzi.io                                          2020-09-14T11:57:41Z
kafkatopics.kafka.strimzi.io                                     2020-09-14T11:57:41Z
kafkausers.kafka.strimzi.io                                      2020-09-14T11:57:41Z
kubeapiservers.operator.openshift.io                             2020-07-09T18:58:40Z
kubecontrollermanagers.operator.openshift.io                     2020-07-09T18:58:31Z
kubedeschedulers.operator.openshift.io                           2020-09-08T14:14:58Z
kubeletconfigs.machineconfiguration.openshift.io                 2020-07-09T19:03:37Z
kubeschedulers.operator.openshift.io                             2020-07-09T18:58:33Z
kubestorageversionmigrators.operator.openshift.io                2020-07-09T18:58:31Z
kubevirtcommontemplatesbundles.ssp.kubevirt.io                   2020-10-26T15:09:10Z
kubevirtmetricsaggregations.ssp.kubevirt.io                      2020-10-26T15:09:10Z
kubevirtnodelabellerbundles.ssp.kubevirt.io                      2020-10-26T15:09:11Z
kubevirts.kubevirt.io                                            2020-10-26T15:09:11Z
kubevirttemplatevalidators.ssp.kubevirt.io                       2020-10-26T15:09:11Z
machineautoscalers.autoscaling.openshift.io                      2020-07-09T18:58:35Z
machineconfigpools.machineconfiguration.openshift.io             2020-07-09T19:03:36Z
machineconfigs.machineconfiguration.openshift.io                 2020-07-09T19:03:34Z
machinehealthchecks.machine.openshift.io                         2020-07-09T18:59:04Z
machines.machine.openshift.io                                    2020-07-09T18:59:04Z
machinesets.machine.openshift.io                                 2020-07-09T18:59:04Z
memcachedajs.cache.aj.t-<redacted>.com                              2020-11-04T12:47:27Z
memcacheds.cache.example.com                                     2020-11-04T08:12:29Z
mvoperators.mv.<redacted>.de                                    2020-11-04T13:18:22Z
mvservices.mv.<redacted>.de                                    2020-09-23T06:21:10Z
netnamespaces.network.openshift.io                               2020-07-09T19:01:53Z
network-attachment-definitions.k8s.cni.cncf.io                   2020-07-09T19:01:47Z
networkaddonsconfigs.networkaddonsoperator.network.kubevirt.io   2020-10-26T15:09:10Z
networks.config.openshift.io                                     2020-07-09T18:58:20Z
networks.operator.openshift.io                                   2020-07-09T18:58:22Z
nodemaintenances.nodemaintenance.kubevirt.io                     2020-10-26T15:09:11Z
nodenetworkconfigurationenactments.nmstate.io                    2020-10-27T14:53:13Z
nodenetworkconfigurationpolicies.nmstate.io                      2020-10-27T14:53:13Z
nodenetworkstates.nmstate.io                                     2020-10-27T14:53:13Z
oauths.config.openshift.io                                       2020-07-09T18:58:21Z
openshiftapiservers.operator.openshift.io                        2020-07-09T18:58:33Z
openshiftartifactoryhas.charts.helm.k8s.io                       2020-07-16T15:40:29Z
openshiftcontrollermanagers.operator.openshift.io                2020-07-09T18:58:34Z
openshiftxrays.charts.helm.k8s.io                                2020-07-30T06:01:35Z
operatorconfigurations.acid.zalan.do                             2020-07-10T12:18:34Z
operatorgroups.operators.coreos.com                              2020-07-09T18:58:49Z
operatorhubs.config.openshift.io                                 2020-07-09T18:58:18Z
operatorpkis.network.operator.openshift.io                       2020-07-09T18:58:46Z
operatorsources.operators.coreos.com                             2020-07-09T18:58:36Z
ovirtproviders.v2v.kubevirt.io                                   2020-10-26T15:09:10Z
pgclusters.crunchydata.com                                       2020-11-09T07:25:21Z
pgpolicies.crunchydata.com                                       2020-11-09T07:25:22Z
pgreplicas.crunchydata.com                                       2020-11-09T07:25:22Z
pgtasks.crunchydata.com                                          2020-11-09T07:25:22Z
pipelineresources.tekton.dev                                     2020-08-06T09:04:22Z
pipelineruns.tekton.dev                                          2020-08-06T09:04:22Z
pipelines.tekton.dev                                             2020-08-06T09:04:22Z
podmonitors.monitoring.coreos.com                                2020-07-09T19:16:30Z
postgresqls.acid.zalan.do                                        2020-07-10T12:18:34Z
profiles.tuned.openshift.io                                      2020-07-09T18:58:35Z
projects.config.openshift.io                                     2020-07-09T18:58:21Z
prometheuses.monitoring.coreos.com                               2020-07-09T19:16:30Z
prometheusrules.monitoring.coreos.com                            2020-07-09T19:16:30Z
provisionings.<redacted>.io                                          2020-07-09T18:59:04Z
proxies.config.openshift.io                                      2020-07-09T18:58:18Z
rolebindingrestrictions.authorization.openshift.io               2020-07-09T18:58:17Z
schedulers.config.openshift.io                                   2020-07-09T18:58:21Z
securitycontextconstraints.security.openshift.io                 2020-07-09T18:58:18Z
servicecas.operator.openshift.io                                 2020-07-09T18:58:35Z
servicecatalogapiservers.operator.openshift.io                   2020-07-09T18:58:33Z
servicecatalogcontrollermanagers.operator.openshift.io           2020-07-09T18:58:33Z
servicemonitors.monitoring.coreos.com                            2020-07-09T19:16:30Z
shes.dt-av-sh.<redacted>.de                                     2020-10-08T10:55:26Z
shes.et-av-sh.<redacted>.de                                     2020-10-16T07:13:15Z
storagestates.migration.k8s.io                                   2020-07-09T18:58:36Z
storageversionmigrations.migration.k8s.io                        2020-07-09T18:58:34Z
subscriptions.operators.coreos.com                               2020-07-09T18:58:41Z
taskruns.tekton.dev                                              2020-08-06T09:04:22Z
tasks.tekton.dev                                                 2020-08-06T09:04:22Z
thanosrulers.monitoring.coreos.com                               2020-07-28T11:15:20Z
tridentbackends.trident.netapp.io                                2020-07-09T21:00:00Z
tridentnodes.trident.netapp.io                                   2020-07-09T21:00:00Z
tridentsnapshots.trident.netapp.io                               2020-07-09T21:00:01Z
tridentstorageclasses.trident.netapp.io                          2020-07-09T21:00:01Z
tridenttransactions.trident.netapp.io                            2020-07-09T21:00:01Z
tridentversions.trident.netapp.io                                2020-07-09T21:00:00Z
tridentvolumes.trident.netapp.io                                 2020-07-09T21:00:00Z
triggerbindings.triggers.tekton.dev                              2020-08-06T09:05:09Z
triggertemplates.triggers.tekton.dev                             2020-08-06T09:05:09Z
tuneds.tuned.openshift.io                                        2020-07-09T18:58:33Z
v2vvmwares.v2v.kubevirt.io                                       2020-10-26T15:09:10Z
vmimportconfigs.v2v.kubevirt.io                                  2020-10-26T15:09:11Z
volumesnapshotclasses.snapshot.storage.k8s.io                    2020-07-09T19:40:09Z
volumesnapshotcontents.snapshot.storage.k8s.io                   2020-07-09T19:40:08Z
volumesnapshots.snapshot.storage.k8s.io                          2020-07-09T19:40:07Z
zdmvoperators.zd.zd-mv.<redacted>.de                            2020-11-05T09:08:10Z

Cluster Utilization seems okay:
CPU: 90-120 / 248 
RAM: 900-1200 GB/ 2550 GB
Network ~100 MBps on idle / ~ 500 MBps peek
Pods: 2,5 - 3,2k

Setup is 3 Master / 2 Infra/ 12 Compute nodes in total.

Any suggestion is welcome.

Comment 2 Maciej Szulik 2020-11-20 21:04:09 UTC

This was moved to https://github.com/kubernetes/kubernetes/pull/96763 and will be picked after next k8s bump.

Comment 3 Maciej Szulik 2020-11-23 14:07:38 UTC

This will be bumped in https://github.com/openshift/oc/pull/648

Comment 5 Maciej Szulik 2020-11-26 14:03:34 UTC

*** Bug 1894574 has been marked as a duplicate of this bug. ***

Comment 6 Maciej Szulik 2020-12-01 15:09:31 UTC

*** Bug 1902816 has been marked as a duplicate of this bug. ***

Comment 7 Maciej Szulik 2020-12-04 16:28:19 UTC

This merged in https://github.com/openshift/oc/pull/660

Comment 9 Mike Fiedler 2020-12-09 13:54:07 UTC

@sreber  Can you estimate total # of CRDs in the cluster when you hit this?   Looking for a reproducer to verify the fix.
You can run this Python2 script to  count them:  https://github.com/openshift/svt/blob/master/openshift_tooling/list_all_resources/list_all.py

e.g. python2 list_all.py -c -s all -o count | tee list.out 

It will take a while as it iterates all CRD types.

Comment 11 Mike Fiedler 2020-12-10 20:23:15 UTC

@soltysh Tested this with:

Client Version: 4.7.0-0.nightly-2020-12-09-112139
Server Version: 4.7.0-0.nightly-2020-12-09-112139

It is better than 4.5 but with enough groups/CRDs it starts happening again.

4.5:  create 100 CRDs in 100 groups, oc get on any of the CRDs gets throttling messages
4.7:  create 100 CRDs in 100 groups, oc get succeeds, no throttling
4.7:  create 200 CRDs in 200 groups, oc get on any of the CRDs gets throttling messages

Do we want to call this good enough for 4.7?   Raising the point where we hit this?  If so,  you can move this back to ON_QA

Comment 12 Maciej Szulik 2020-12-11 12:22:47 UTC

(In reply to Mike Fiedler from comment #11)
> @soltysh Tested this with:
> 
> Client Version: 4.7.0-0.nightly-2020-12-09-112139
> Server Version: 4.7.0-0.nightly-2020-12-09-112139
> 
> It is better than 4.5 but with enough groups/CRDs it starts happening again.
> 
> 4.5:  create 100 CRDs in 100 groups, oc get on any of the CRDs gets
> throttling messages
> 4.7:  create 100 CRDs in 100 groups, oc get succeeds, no throttling
> 4.7:  create 200 CRDs in 200 groups, oc get on any of the CRDs gets
> throttling messages
> 
> Do we want to call this good enough for 4.7?   Raising the point where we
> hit this?  If so,  you can move this back to ON_QA

The default cluster comes with ~100 CRDs, and getting throttling at that was a problem.
If you're saying that we're hitting it at 200 I think that gives us about 100 CRDs 
room, which is sufficient at least for the time being. If we notice these numbers
are not sufficient we can definitely bump them higher. So I'm moving this back to qa.

Comment 17 Jens Reimann 2020-12-18 08:40:12 UTC

As this is not only annoying, but also limits throughput of requests (because throttling actually happens), what is the workaround for 4.6?

I took my while to finally end up in this bugzilla issue? It would have really helped my is this was listed in the "Known Issues" section? -> https://docs.openshift.com/container-platform/4.6/release_notes/ocp-4-6-release-notes.html#ocp-4-6-known-issues

Comment 20 Mike Fiedler 2020-12-18 20:39:47 UTC

Marking verified based on comment 12 and comment 14.  Verified on 4.7.0-0.nightly-2020-12-09-112139
Follow https://bugzilla.redhat.com/show_bug.cgi?id=1906332 for 4.6.z backport

Comment 21 Maciej Szulik 2021-01-05 12:08:55 UTC

*** Bug 1909280 has been marked as a duplicate of this bug. ***

Comment 22 milti leonard 2021-01-06 19:00:52 UTC

maciel, my cu came back w a must-gather over the holidays: its available from the cu ticket [1] here: https://attachments.access.redhat.com/hydra/rest/cases/02774693/attachments/9e0abd3f-182b-46c5-b47b-ee5f61a44933?usePresignedUrl=true



[1] https://url.corp.redhat.com/486a9c4

Comment 23 Chet Hosey 2021-01-11 17:30:22 UTC

We're at 173 CRDs on a test cluster that has been upgraded from 4.1 through to 4.6, and uses OCS, and a few items from Operator Hub (virtualization, JBoss operator, service mesh, pipelines, metering, etc.). CLI operations are noticeably slower than when the cluster was still on 4.4, and they're slower than on our production cluster (which is still at 4.4).

200 seems like a pretty low threshold once you start enabling components that are deployed via Operator Hub.

Comment 24 Jens Reimann 2021-01-12 09:06:33 UTC

Same here. Having a 4.1 cluster upgraded all the way to 4.6, and having things like Service Mesh, Serverless, SSO, Pipelines, etc installed via OperatorHub I am already at "221" CRDs. Assuming that people are expected to install more services/components via Operators you should probably target a much higher number.

Comment 25 Maciej Szulik 2021-01-12 11:53:20 UTC

After reconsidering this issue I'm going to bump this slightly a bit to 250.

Comment 26 Maciej Szulik 2021-01-13 11:39:05 UTC

The last bump to 250 merged in https://github.com/openshift/oc/pull/696

Comment 28 milti leonard 2021-01-14 16:45:13 UTC

hello,

IHAC (tkt#02774693) on 4.6 and started deleting CRDs in his cluster where this was being seen and noted that the threshold for throttling msgs apparently is 158 CRDs in his cluster: 159 CRDs, the throttling msgs started again in his output. is this something thats configurable by the user or can this threshold only be set at install or is hard-coded?

thnx, m

Comment 29 Mike Fiedler 2021-01-18 19:44:35 UTC

Verified on 4.7.0-0.nightly-2021-01-18-144603

Created 200 CRD on top of OOTB CRDs and no throttling messages
Created 250 CRD on top of OOTB CRDs and saw very rare throttling messages - much less seldom than when creating 200 CRDs with the limit set to 200.

Comment 30 Maciej Szulik 2021-01-21 11:54:10 UTC

*** Bug 1869847 has been marked as a duplicate of this bug. ***

Comment 35 Ann Hayes 2021-01-27 10:47:04 UTC

we are running on 4.6.12 and have 253 CRDS and we are continually seeing this throttling issue.
Can we increase the limit over 250 and will this be pulled back into 4.6.x release.

Comment 36 Maciej Szulik 2021-01-28 09:29:43 UTC

(In reply to Ann Hayes from comment #35)
> we are running on 4.6.12 and have 253 CRDS and we are continually seeing
> this throttling issue.
> Can we increase the limit over 250 and will this be pulled back into 4.6.x
> release.

Yes, this was bumped to 250 and a cherry-picked into 4.6 (https://github.com/openshift/oc/pull/716)
should be present in the upcoming .z stream release.

Comment 37 Chet Hosey 2021-01-28 09:44:17 UTC

Why isn’t this user-configurable?

Better yet, why is there an arbitrary threshold beyond which users get annoying warnings and delays?

How does this enhance the user experience? Is there any benefit at all to this behavior, and where is it explained?

Comment 38 Maciej Szulik 2021-01-28 10:01:12 UTC

(In reply to Chet Hosey from comment #37)
> Why isn’t this user-configurable?

It is not expected to be user-configurable since this is one of the several layers of 
preventing the server from being exhausted with requests. Specifically retrieving
complete discovery information requires reaching to many endpoints and thus is limited.

Comment 39 Ann Hayes 2021-01-28 10:08:53 UTC

(In reply to Maciej Szulik from comment #36)
> (In reply to Ann Hayes from comment #35)
> > we are running on 4.6.12 and have 253 CRDS and we are continually seeing
> > this throttling issue.
> > Can we increase the limit over 250 and will this be pulled back into 4.6.x
> > release.
> 
> Yes, this was bumped to 250 and a cherry-picked into 4.6
> (https://github.com/openshift/oc/pull/716)
> should be present in the upcoming .z stream release.

Sorry, what is the .z stream -- will it be in the stable-4.6.x so we can apply on AMD64

Why 250 when we have problems at 253? shouldn't the number be much higher

Comment 40 Maciej Szulik 2021-01-28 10:15:41 UTC

> Sorry, what is the .z stream -- will it be in the stable-4.6.x so we can
> apply on AMD64

yeah, the next stable release should contain it. 

> Why 250 when we have problems at 253? shouldn't the number be much higher

That was a safe middle-ground number according to our tests. The actual
numbers will differ from installation to installation but we didn't want to
go to extremes with this setting.

Comment 41 Ann Hayes 2021-01-28 13:19:58 UTC

(In reply to Maciej Szulik from comment #40)
> > Sorry, what is the .z stream -- will it be in the stable-4.6.x so we can
> > apply on AMD64
> 
> yeah, the next stable release should contain it. 
> 
> > Why 250 when we have problems at 253? shouldn't the number be much higher
> 
> That was a safe middle-ground number according to our tests. The actual
> numbers will differ from installation to installation but we didn't want to
> go to extremes with this setting.

Thanks for your reply, what is that next target release - 4.6.13? Can you please confirm

Comment 42 Maciej Szulik 2021-01-28 13:34:09 UTC

> Thanks for your reply, what is that next target release - 4.6.13? Can you
> please confirm

Yeah, I think that should be that.

Comment 43 Chet Hosey 2021-01-28 14:17:06 UTC

As a user that sees `oc` invocations stalled seemingly at random I wonder if the cure is worse than the disease.

Right now it still isn't clear how this behavior is supposed to be helpful.

Is there documentation that would clarify?

Comment 44 Robert Baumgartner 2021-02-03 11:40:04 UTC

(In reply to Maciej Szulik from comment #38)
> It is not expected to be user-configurable since this is one of the several layers of 
> preventing the server from being exhausted with requests. Specifically retrieving
> complete discovery information requires reaching to many endpoints and thus is limited.

if this is not user-configurable it should be "auto-tuned"! depending of the actual numbers of CRDs
I am running 4.7.0.fc-5 with less than 200 CRDs (by removing Service Mash - 31 CRDs!) and still get the error

$ oc version
Client Version: 4.6.8
Server Version: 4.7.0-fc.5
Kubernetes Version: v1.20.0+3b90e69
$ oc get crds|wc -l
    187  
$ oc get all -n demo
I0203 12:28:28.294458  987490 request.go:645] Throttling request took 1.157750898s, request: GET:https://api.ocp4.openshift.freeddns.org:6443/apis/node.k8s.io/v1beta1?timeout=32s
NAME                                 READY   STATUS    RESTARTS   AGE
pod/nodejs-sample-67f458cd89-4wg9w   1/1     Running   0          18h
...

Comment 48 Jens Reimann 2021-02-11 16:00:29 UTC

If I understood the problem correctly, it is a client side issue.

I had my cluster updated to 4.6.16 and still had the issue. After updating the "oc" client to 4.6.16, the issue was gone.

I guess this actually is an issue in the upstream Go client for Kubernetes. As I saw the same problem in the keycloak operator, which started to throttle as well. It might also be that "kubectl" has the same limitation.

Comment 51 Chet Hosey 2021-02-24 01:12:29 UTC

Is it documented anywhere why “randomly slow down the client when too many Red-Hat-supported features are enabled” is a feature in the first place?

It would really be nice to have an explanation around who benefits from this behavior. Why does this exist at all?

Comment 52 errata-xmlrpc 2021-02-24 15:34:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 53 milti leonard 2021-02-24 16:07:56 UTC

@maciej, will this fix be backported to 4.6?

Comment 60 morningspace 2021-07-06 15:30:20 UTC

Happen to come across here and I'd like to know whether the throttling will also impact the application deployment on OCP. Our env have 400+ CRDs and I can always see the throttling message when I run every single oc command! More worse is that one our application which includes tons of operators has failed to be deployed to this env for quite long time. Some pods were not successfully up and running. I'm suspecting this is also caused by the throttling, because I'm supposing this also issue would also happen in many Go operators' code where they may use Go client for Kubernete and could fail.

Comment 61 Jens Reimann 2021-07-07 06:36:18 UTC

(In reply to morningspace from comment #60)
> Happen to come across here and I'd like to know whether the throttling will
> also impact the application deployment on OCP. Our env have 400+ CRDs and I
> can always see the throttling message when I run every single oc command!
> More worse is that one our application which includes tons of operators has
> failed to be deployed to this env for quite long time. Some pods were not
> successfully up and running. I'm suspecting this is also caused by the
> throttling, because I'm supposing this also issue would also happen in many
> Go operators' code where they may use Go client for Kubernete and could fail.

This is indeed the case. I already saw a few solutions, building on top of the Go based Kubernetes client to throttle as well. ArgoCD, the Keycloak operator, but also tools like the Helm command line tool.

Using tools not created on top of this library doesn't suffer from this behavior. So maybe using Rust or Java is an alternative :D

I also noticed that the message that gets written out has improved a bit. It now specifically mentions "client side" throttling, which is an improvement I guess.

Comment 62 morningspace 2021-07-07 08:52:54 UTC

Alright, so, with that, I guess just to upgrade OCP to a newer version of 4.6.x, such as 4.6.37, which appears to be the latest 4.6 release at the moment, would not resolve the issue, because this issue comes from the client side, where I see the fix went into oc, can probably resolve the issue from command line. However, those Kubernetes go clients are being referenced inside each go operator code, the issue will still happen, even both OCP and oc are upgraded, unless we modify the go operator code.

Comment 63 damien 2021-08-23 11:40:40 UTC

Don't forget to update you oc binary. I was having the issue with oc 4.5.0 and a cluster on 4.7.22 with a few additional CRD (argo, jaeger, ipam, etc...).

I've updated oc to 4.7.0-202107261701.p0.git.8b4b094.assembly.stream-8b4b094 and the problem disappeared.

Note You need to log in before you can comment on or make changes to this bug.

agomezpr
alchan
anowak
aos-bugs
apurty
bjarolim
ChetRHosey
damien
dkulkarn
dmoessne
dseals
hayes
jlyle
jokerman
jreimann
jrosenta
ksathe
maszulik
mchebbi
mfojtik
mifiedle
mleonard
morningspace
ngirard
nnosenzo
oarribas
openshift-bugs-escalate
prubenda
rautenberg
rbaumgar
rheinzma
rpalathi
rugouvei
sbhavsar
vkochuku