Just to summarize:
Here is the test matrix with result where OK claims slowness is NOT present:
> OCP v4.3 + OCS v4.4 - OK
> OCP v4.3 + OCS v4.3 - OK
> OCP v4.5 + OCS v4.4 - not OK
> OCP v4.4 + OCS v4.4 - not OK
> so once switch is to OCP v4.4 issue with slow pods starting is present.
There is also a mention in parent bug https://bugzilla.redhat.com/show_bug.cgi?id=1836198#c39 that, this is an issue also for gp2 storage class. In one way it isolates or relaxes the possibility of an issue with CSI driver or CSI layer.
The possible issue/fixes are pointed below:
> Does OCS use external-attacher? You may be hitting this:
> Check this comment for workarounds:
> - does this help?
> There is also PR in progress:
Waiting for upstream PR: https://github.com/kubernetes/kubernetes/pull/91307
(In reply to Jan Safranek from comment #4)
> And of course, once the PR is available, we're going to backport it to all
> the way to 4.4
Thanks Jan for the update. The upstream PR is good in shape apart from missing the tests.
Hopefully it will get there soon.
Apart from that, thinking about the solution or avoiding the possibilities of getting into the situation:
This is what I could come up with. That said, the Ceph CSI driver does not make use of
CONTROLLER PUBLISH and UNPUBLISH calls or these capabilities are not exposed from the driver.
However with the history of the development of CSI and Ceph CSI plugin we ***were*** making use of external-attacher sidecar till now. But, the upstream has a feature or solution to `skip attach` being a part of the CSIDriver object. Considering this is GA ( actually CSIDriver functionality) with "1.18", my proposal is (https://github.com/ceph/ceph-csi/issues/1106) to get rid of 'external-attacher" completely in Ceph CSI driver implementation by making use of this field. However this needs extensive testing in the driver..etc which we will get on with release v3.0.0 of upstream.
Test result - copied from bz#183698
I tested with
and OCS v4.4 - quay.io/rhceph-dev/ocs-olm-operator:latest-stable-4.4 and I see improvement if compared to state from comment #1 of this BZ.
With OCS v4.6 pods starts at at same speed as it was with OCP v4.3.
From beginning OCS was not problematic side as OCP v4.4 / OCP v4.5 + OCS 4.4 / OCS v4.5 / OCS v4.3 were problematic.
Now, with OCP v4.6 + OCS v4.4 result is satisfying, pods are starting fine.
first batch of 1000 pods with PVCs: 10m 17sec
second batch of 1000 pods with PVCs: 10m 27sec
third batch of 1000 pods with PVCs: 10m 25s
In this test 500 pods with PVC per OCP node were scheduled.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.