Bug 1839933 - Pods with PVCs attached takes long time to start
Summary: Pods with PVCs attached takes long time to start
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Jan Safranek
QA Contact: Wei Duan
URL:
Whiteboard:
Depends On:
Blocks: 1836198 1854311 1870193
TreeView+ depends on / blocked
 
Reported: 2020-05-26 04:17 UTC by Humble Chirammal
Modified: 2021-01-26 15:59 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 1836198
: 1854311 (view as bug list)
Environment:
Last Closed: 2020-10-27 16:01:02 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25109 0 None closed Bug 1839933: UPSTREAM: 91307: CSI: Modify VolumeAttachment check to use Informer/Cache 2021-01-25 16:24:05 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:01:27 UTC

Internal Links: 1836198

Comment 1 Humble Chirammal 2020-05-26 04:23:15 UTC
Just to summarize:

Here is the test matrix with result where OK claims slowness is NOT present:

> OCP v4.3 + OCS v4.4 - OK 
> OCP v4.3 + OCS v4.3 - OK
> 
> OCP v4.5 + OCS v4.4 - not OK
> OCP v4.4 + OCS v4.4 - not OK
> 
> so once switch is to OCP v4.4 issue with slow pods starting is present.


There is also a mention in parent bug https://bugzilla.redhat.com/show_bug.cgi?id=1836198#c39 that, this is an issue also for gp2 storage class. In one way it isolates or relaxes the possibility of an issue with CSI driver or CSI layer.

The possible issue/fixes are pointed below:

--snip--

> Does OCS use external-attacher? You may be hitting this:
> https://github.com/kubernetes/kubernetes/issues/84169.
> Check this comment for workarounds:
> https://github.com/kubernetes/kubernetes/issues/84169#issuecomment-545692734
> - does this help?
> There is also PR in progress:
> https://github.com/kubernetes/kubernetes/pull/91307

--/snip--

Comment 2 Jan Safranek 2020-05-26 07:48:51 UTC
Waiting for upstream PR: https://github.com/kubernetes/kubernetes/pull/91307

Comment 5 Humble Chirammal 2020-05-27 04:58:10 UTC
(In reply to Jan Safranek from comment #4)
> And of course, once the PR is available, we're going to backport it to all
> the way to 4.4

Thanks Jan for the update. The upstream PR is good in shape apart from missing the tests.

Hopefully it will get there soon. 

Apart from that, thinking about the solution or avoiding the possibilities of getting into the situation:

This is what I could come up with. That said, the Ceph CSI driver does not make use of 
CONTROLLER PUBLISH and UNPUBLISH calls or these capabilities are not exposed from the driver.
However with the history of the development of CSI and Ceph CSI plugin we ***were*** making use of external-attacher sidecar till now. But, the upstream has a feature or solution to `skip attach` being a part of the CSIDriver object. Considering this is GA ( actually CSIDriver functionality) with "1.18", my proposal is (https://github.com/ceph/ceph-csi/issues/1106) to get rid of 'external-attacher" completely in Ceph CSI driver implementation by making use of this field. However this needs extensive testing in the driver..etc which we will get on with release v3.0.0 of upstream.

Comment 11 Humble Chirammal 2020-06-30 12:18:06 UTC
Test result - copied from bz#183698

--snip--

I tested with 

https://openshift-release.apps.ci.l2s4.p1.openshiftapps.com/releasestream/4.6.0-0.nightly/release/4.6.0-0.nightly-2020-06-30-020342 

and OCS v4.4 - quay.io/rhceph-dev/ocs-olm-operator:latest-stable-4.4 and I see improvement if compared to state from comment #1 of this BZ. 

With OCS v4.6 pods starts at at same speed as it was with OCP v4.3.

From beginning OCS was not problematic side as OCP v4.4 / OCP v4.5 + OCS 4.4 / OCS v4.5 / OCS v4.3 were problematic.


Now, with OCP v4.6 + OCS v4.4 result is satisfying, pods are starting fine. 

Start times: 

first batch of 1000 pods with PVCs: 10m 17sec
second batch of 1000 pods with PVCs: 10m 27sec
third batch of 1000 pods with PVCs: 10m 25s

In this test 500 pods with PVC per OCP node were scheduled. 
--/snip--

Comment 13 errata-xmlrpc 2020-10-27 16:01:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.