Bug 1839933

Summary: Pods with PVCs attached takes long time to start
Product: OpenShift Container Platform Reporter: Humble Chirammal <hchiramm>
Component: StorageAssignee: Jan Safranek <jsafrane>
Storage sub component: Kubernetes QA Contact: Wei Duan <wduan>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, bniver, chaoyang, ebenahar, ekuric, gmeno, jsafrane, kramdoss, madam, mrajanna, muagarwa, ocs-bugs, ratamir, rperiyas, sostapov, ykaul
Version: 4.4Keywords: Performance, Regression
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1836198
: 1854311 (view as bug list) Environment:
Last Closed: 2020-10-27 16:01:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1836198, 1854311, 1870193    

Comment 1 Humble Chirammal 2020-05-26 04:23:15 UTC
Just to summarize:

Here is the test matrix with result where OK claims slowness is NOT present:

> OCP v4.3 + OCS v4.4 - OK 
> OCP v4.3 + OCS v4.3 - OK
> 
> OCP v4.5 + OCS v4.4 - not OK
> OCP v4.4 + OCS v4.4 - not OK
> 
> so once switch is to OCP v4.4 issue with slow pods starting is present.


There is also a mention in parent bug https://bugzilla.redhat.com/show_bug.cgi?id=1836198#c39 that, this is an issue also for gp2 storage class. In one way it isolates or relaxes the possibility of an issue with CSI driver or CSI layer.

The possible issue/fixes are pointed below:

--snip--

> Does OCS use external-attacher? You may be hitting this:
> https://github.com/kubernetes/kubernetes/issues/84169.
> Check this comment for workarounds:
> https://github.com/kubernetes/kubernetes/issues/84169#issuecomment-545692734
> - does this help?
> There is also PR in progress:
> https://github.com/kubernetes/kubernetes/pull/91307

--/snip--

Comment 2 Jan Safranek 2020-05-26 07:48:51 UTC
Waiting for upstream PR: https://github.com/kubernetes/kubernetes/pull/91307

Comment 5 Humble Chirammal 2020-05-27 04:58:10 UTC
(In reply to Jan Safranek from comment #4)
> And of course, once the PR is available, we're going to backport it to all
> the way to 4.4

Thanks Jan for the update. The upstream PR is good in shape apart from missing the tests.

Hopefully it will get there soon. 

Apart from that, thinking about the solution or avoiding the possibilities of getting into the situation:

This is what I could come up with. That said, the Ceph CSI driver does not make use of 
CONTROLLER PUBLISH and UNPUBLISH calls or these capabilities are not exposed from the driver.
However with the history of the development of CSI and Ceph CSI plugin we ***were*** making use of external-attacher sidecar till now. But, the upstream has a feature or solution to `skip attach` being a part of the CSIDriver object. Considering this is GA ( actually CSIDriver functionality) with "1.18", my proposal is (https://github.com/ceph/ceph-csi/issues/1106) to get rid of 'external-attacher" completely in Ceph CSI driver implementation by making use of this field. However this needs extensive testing in the driver..etc which we will get on with release v3.0.0 of upstream.

Comment 11 Humble Chirammal 2020-06-30 12:18:06 UTC
Test result - copied from bz#183698

--snip--

I tested with 

https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.6.0-0.nightly/release/4.6.0-0.nightly-2020-06-30-020342 

and OCS v4.4 - quay.io/rhceph-dev/ocs-olm-operator:latest-stable-4.4 and I see improvement if compared to state from comment #1 of this BZ. 

With OCS v4.6 pods starts at at same speed as it was with OCP v4.3.

From beginning OCS was not problematic side as OCP v4.4 / OCP v4.5 + OCS 4.4 / OCS v4.5 / OCS v4.3 were problematic.


Now, with OCP v4.6 + OCS v4.4 result is satisfying, pods are starting fine. 

Start times: 

first batch of 1000 pods with PVCs: 10m 17sec
second batch of 1000 pods with PVCs: 10m 27sec
third batch of 1000 pods with PVCs: 10m 25s

In this test 500 pods with PVC per OCP node were scheduled. 
--/snip--

Comment 13 errata-xmlrpc 2020-10-27 16:01:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196