Bug 1613781 - scalability issue with external provisioner controller on leader election and api throttling.
Summary: scalability issue with external provisioner controller on leader election and...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
high
low
Target Milestone: ---
: 3.10.z
Assignee: Bradley Childs
QA Contact: Liang Xia
URL:
Whiteboard:
Depends On:
Blocks: 1609360
TreeView+ depends on / blocked
 
Reported: 2018-08-08 10:23 UTC by Humble Chirammal
Modified: 2023-09-14 04:32 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-20 19:33:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1609360 0 unspecified CLOSED [Tracker-OCP-BZ#1613781] scalability issue at external dynamic prov 2023-09-14 04:32:16 UTC

Internal Links: 1609360

Description Humble Chirammal 2018-08-08 10:23:43 UTC
Description of problem:


Scalability issues with external provisioner controller. While we scale up our ( gluster block PVCs served by the external provisioner in CNS) setup to >100 PVC requests, we are seeing congestion in serving PVC requests. The issues are mainly on leader election and api throttling which is known and getting addressed in external storage repo. 

--snip-- from another bug

I requested >100 PVCs via:
for i in $(seq 0 100); do sed s/PVCNAME/block-1-$i/ bc.yaml | oc apply -f- ; done

There's an initial burst of activity in the block provisioner logs and then activity decreases and does not appear to resume. Heketi logs indicate that it is not getting any requests for new block volumes.

The logging from the block provisioner mainly appears to log lines like:
I0728 14:27:56.953340       1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-20[c06a4991-926f-11e8-b008-5254005f433e]]
I0728 14:27:56.953352       1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-33[c2759f71-926f-11e8-b008-5254005f433e]]
I0728 14:27:56.953365       1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-26[c1525238-926f-11e8-b008-5254005f433e]]
I0728 14:27:56.953377       1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-43[c4598d59-926f-11e8-b008-5254005f433e]]
I0728 14:27:56.953387       1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-78[ca3d61dc-926f-11e8-b008-5254005f433e]]
I0728 14:27:56.953400       1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-64[c8048cb3-926f-11e8-b008-5254005f433e]]
I0728 14:27:56.953411       1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-84[cb4a2c2a-926f-11e8-b008-5254005f433e]]
I0728 14:27:56.953422       1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-90[cc45933f-926f-11e8-b008-5254005f433e]]


I did some searching around and found the following links:
  https://github.com/kubernetes-csi/external-provisioner/issues/68

  https://github.com/kubernetes-incubator/external-storage/pull/837

  https://github.com/kubernetes-csi/external-provisioner/pull/104

  https://github.com/kubernetes-incubator/external-storage/pull/825

  https://github.com/kubernetes-incubator/external-storage/commit/126c9ffc8ef125460f6bd75d903ea1d26051b730#diff-634ce7794d379f3ba2119e2413074537


They indicate that there are potentially scalability issues with how the external provisioners communicate with the k8s api server. Suspiciously, one of the above links mentions "100 PVCs created at same time" which is quite similar to how our PVCs were created.

--/snip--




Version-Release number of selected component (if applicable):

OCP 3.10

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 1 Humble Chirammal 2018-08-08 10:25:03 UTC
Additional Info: https://bugzilla.redhat.com/show_bug.cgi?id=1609360

Comment 2 Wei Sun 2018-08-09 03:27:55 UTC
Per the version(OCP 3.10) mentioned in bug,change the bug's version from 3.7.1 to 3.10,if it's not correct,feel free to change back.

Comment 3 Humble Chirammal 2018-08-10 11:27:55 UTC
(In reply to Wei Sun from comment #2)
> Per the version(OCP 3.10) mentioned in bug,change the bug's version from
> 3.7.1 to 3.10,if it's not correct,feel free to change back.

Thats correct. lets keep it in 3.10 version.

Comment 4 Matthew Wong 2018-08-20 11:20:52 UTC
@Humble, please update the provisioner to use lib v5.0.1 and let's see if it helps

Comment 5 Humble Chirammal 2018-08-23 06:13:14 UTC
(In reply to Matthew Wong from comment #4)
> @Humble, please update the provisioner to use lib v5.0.1 and let's see if it
> helps

Sure, let me try the new lib. I will let you know how it goes! Thanks!

Comment 6 Humble Chirammal 2018-08-27 11:17:47 UTC
(In reply to Humble Chirammal from comment #5)
> (In reply to Matthew Wong from comment #4)
> > @Humble, please update the provisioner to use lib v5.0.1 and let's see if it
> > helps
> 
> Sure, let me try the new lib. I will let you know how it goes! Thanks!

Matthew, I built gluster block provisioner with the lib v5.0.1 and it looks like its broken on RBAC. More details are in https://bugzilla.redhat.com/show_bug.cgi?id=1609360#c34 and https://bugzilla.redhat.com/show_bug.cgi?id=1609360#c35.

It seems that, https://github.com/kubernetes-incubator/external-storage/commit/d46083d75be3c046c5313c03c0b4b5f29d9b2ec2 has broken the provisioners in the repo, or atleast gluster block provisioner. 

We can definitely revert that patch, however I would like to get second pair of eyes on this.

Comment 10 Stephen Cuppett 2019-11-20 19:33:54 UTC
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed.

[1]: https://access.redhat.com/support/policy/updates/openshift

Comment 11 Red Hat Bugzilla 2023-09-14 04:32:54 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.