Description of problem: Scalability issues with external provisioner controller. While we scale up our ( gluster block PVCs served by the external provisioner in CNS) setup to >100 PVC requests, we are seeing congestion in serving PVC requests. The issues are mainly on leader election and api throttling which is known and getting addressed in external storage repo. --snip-- from another bug I requested >100 PVCs via: for i in $(seq 0 100); do sed s/PVCNAME/block-1-$i/ bc.yaml | oc apply -f- ; done There's an initial burst of activity in the block provisioner logs and then activity decreases and does not appear to resume. Heketi logs indicate that it is not getting any requests for new block volumes. The logging from the block provisioner mainly appears to log lines like: I0728 14:27:56.953340 1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-20[c06a4991-926f-11e8-b008-5254005f433e]] I0728 14:27:56.953352 1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-33[c2759f71-926f-11e8-b008-5254005f433e]] I0728 14:27:56.953365 1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-26[c1525238-926f-11e8-b008-5254005f433e]] I0728 14:27:56.953377 1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-43[c4598d59-926f-11e8-b008-5254005f433e]] I0728 14:27:56.953387 1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-78[ca3d61dc-926f-11e8-b008-5254005f433e]] I0728 14:27:56.953400 1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-64[c8048cb3-926f-11e8-b008-5254005f433e]] I0728 14:27:56.953411 1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-84[cb4a2c2a-926f-11e8-b008-5254005f433e]] I0728 14:27:56.953422 1 controller.go:1167] scheduleOperation[lock-provision-cns/block-1-90[cc45933f-926f-11e8-b008-5254005f433e]] I did some searching around and found the following links: https://github.com/kubernetes-csi/external-provisioner/issues/68 https://github.com/kubernetes-incubator/external-storage/pull/837 https://github.com/kubernetes-csi/external-provisioner/pull/104 https://github.com/kubernetes-incubator/external-storage/pull/825 https://github.com/kubernetes-incubator/external-storage/commit/126c9ffc8ef125460f6bd75d903ea1d26051b730#diff-634ce7794d379f3ba2119e2413074537 They indicate that there are potentially scalability issues with how the external provisioners communicate with the k8s api server. Suspiciously, one of the above links mentions "100 PVCs created at same time" which is quite similar to how our PVCs were created. --/snip-- Version-Release number of selected component (if applicable): OCP 3.10 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info:
Additional Info: https://bugzilla.redhat.com/show_bug.cgi?id=1609360
Per the version(OCP 3.10) mentioned in bug,change the bug's version from 3.7.1 to 3.10,if it's not correct,feel free to change back.
(In reply to Wei Sun from comment #2) > Per the version(OCP 3.10) mentioned in bug,change the bug's version from > 3.7.1 to 3.10,if it's not correct,feel free to change back. Thats correct. lets keep it in 3.10 version.
@Humble, please update the provisioner to use lib v5.0.1 and let's see if it helps
(In reply to Matthew Wong from comment #4) > @Humble, please update the provisioner to use lib v5.0.1 and let's see if it > helps Sure, let me try the new lib. I will let you know how it goes! Thanks!
(In reply to Humble Chirammal from comment #5) > (In reply to Matthew Wong from comment #4) > > @Humble, please update the provisioner to use lib v5.0.1 and let's see if it > > helps > > Sure, let me try the new lib. I will let you know how it goes! Thanks! Matthew, I built gluster block provisioner with the lib v5.0.1 and it looks like its broken on RBAC. More details are in https://bugzilla.redhat.com/show_bug.cgi?id=1609360#c34 and https://bugzilla.redhat.com/show_bug.cgi?id=1609360#c35. It seems that, https://github.com/kubernetes-incubator/external-storage/commit/d46083d75be3c046c5313c03c0b4b5f29d9b2ec2 has broken the provisioners in the repo, or atleast gluster block provisioner. We can definitely revert that patch, however I would like to get second pair of eyes on this.
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed. [1]: https://access.redhat.com/support/policy/updates/openshift
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days