@Karthick, do we see the same in our tests?
@Elvir, is this also showing on OCP 4.4?
Thanks Elvir, Based on this info, marking as regression
Elvir, let me understand some more here: Is the below table reflect your testing result? : OCS 4.3 + OCP 4.3 -> works OCS 4.3 + OCP 4.4 -> fail OCS 4.4 + OCP 4.4 -> fail Also, what is the result of: OCS 4.4 + OCP 4.3 ?? Are all the tests performed in same kind of setup? say: same cloud provider, number of nodes, spec ..etc? When you say, pods are stuck at `Container creating` stage, is the PVCs are also getting created when pod comes up ? of PVCs are created upfront and pod just use those to attach volumes?
(In reply to Humble Chirammal from comment #11) > Elvir, let me understand some more here: > > Is the below table reflect your testing result? : > > OCS 4.3 + OCP 4.3 -> works > OCS 4.3 + OCP 4.4 -> fail > OCS 4.4 + OCP 4.4 -> fail I tested so far OCP v4.5 + OCS v4.4 - fail OCP v4.4 + OCS v4.4 - fail OCP v4.3 + OCS v4.3 - works I did not tested yet "OCS 4.3 + OCP 4.4" > > Also, what is the result of: > > OCS 4.4 + OCP 4.3 ?? waiting on quay.io to be operational in order to test. I expect to have this information today later. > > Are all the tests performed in same kind of setup? say: same cloud provider, > number of nodes, spec ..etc? yes: aws - m5.4xlarge - 6 OCS nodes, 12 app nodes - same environment. > > > When you say, pods are stuck at `Container creating` stage, is the PVCs are > also getting created when pod comes up ? of PVCs are created upfront and pod > just use those to attach volumes? PVCs are created - I can see desired number of PV/PVCs in "oc get pv/pvc", also I can see that this number is equal to number I get on ceph side with # rbd ls example-storagecluster-cephblockpool | wc -l ie. If I want 2000 PVCs, then "# rbd ls example-storagecluster-cephblockpool | wc -l " will show 2000 on ceph side. The problem here is that pods for above configurations stay very long in "ContainerCreating" state ( images for pods are present on machines so no issue from that side ) and these ( second batch of 1000 pods ) starts after 90mins, or longer time. From an single pod logs we see Normal Scheduled <unknown> default-scheduler Successfully assigned pvcscalea0/pvc-pod-qj9kb to ip-10-0-142-115.eu-central-1.compute.internal Warning FailedMount 23m kubelet, ip-10-0-142-115.eu-central-1.compute.internal Unable to attach or mount volumes: unmounted volumes=[persistentvolume], unattached volumes=[default-token-wh87w persistentvolume]: timed out waiting for the condition Warning FailedMount 4m59s (x9 over 25m) kubelet, ip-10-0-142-115.eu-central-1.compute.internal Unable to attach or mount volumes: unmounted volumes=[persistentvolume], unattached volumes=[persistentvolume default-token-wh87w]: timed out waiting for the condition Normal Pulled 4m45s kubelet, ip-10-0-142-115.eu-central-1.compute.internal Container image "quay.io/openshift-scale/scale-ci-workload" already present on machine Normal Created 4m44s kubelet, ip-10-0-142-115.eu-central-1.compute.internal Created container fio-pod Normal Started 4m44s kubelet, ip-10-0-142-115.eu-central-1.compute.internal Started container fio-pod so here it waits cca 19mins between these two error messages "timed out waiting for the condition". Will update later this bz with test result for OCP v4.3 + OCS v4.4
(In reply to Elvir Kuric from comment #12) > (In reply to Humble Chirammal from comment #11) > > Elvir, let me understand some more here: > > > > Is the below table reflect your testing result? : > > > > OCS 4.3 + OCP 4.3 -> works > > OCS 4.3 + OCP 4.4 -> fail > > OCS 4.4 + OCP 4.4 -> fail > I tested so far > > OCP v4.5 + OCS v4.4 - fail > OCP v4.4 + OCS v4.4 - fail > OCP v4.3 + OCS v4.3 - works > > I did not tested yet "OCS 4.3 + OCP 4.4" Sure, I was trying to isolate the component here. > > > > > Also, what is the result of: > > > > OCS 4.4 + OCP 4.3 ?? > waiting on quay.io to be operational in order to test. I expect to have this > information today later. Sure, that should help us! > > > > > Are all the tests performed in same kind of setup? say: same cloud provider, > > number of nodes, spec ..etc? > > yes: aws - m5.4xlarge - 6 OCS nodes, 12 app nodes - same environment. > > > > > > When you say, pods are stuck at `Container creating` stage, is the PVCs are > > also getting created when pod comes up ? of PVCs are created upfront and pod > > just use those to attach volumes? > > PVCs are created - I can see desired number of PV/PVCs in "oc get pv/pvc", > also I can see that this number is equal to number I get on ceph side with > > # rbd ls example-storagecluster-cephblockpool | wc -l > > ie. If I want 2000 PVCs, then "# rbd ls example-storagecluster-cephblockpool > | wc -l " will show 2000 on ceph side. > > The problem here is that pods for above configurations stay very long in > "ContainerCreating" state ( images for pods are present on machines so no > issue from that side ) and these ( second batch of 1000 pods ) starts after > 90mins, or longer time. > > From an single pod logs we see > > Normal Scheduled <unknown> default-scheduler > Successfully assigned pvcscalea0/pvc-pod-qj9kb to > ip-10-0-142-115.eu-central-1.compute.internal > Warning FailedMount 23m kubelet, > ip-10-0-142-115.eu-central-1.compute.internal Unable to attach or mount > volumes: unmounted volumes=[persistentvolume], unattached > volumes=[default-token-wh87w persistentvolume]: timed out waiting for the > condition > Warning FailedMount 4m59s (x9 over 25m) kubelet, > ip-10-0-142-115.eu-central-1.compute.internal Unable to attach or mount > volumes: unmounted volumes=[persistentvolume], unattached > volumes=[persistentvolume default-token-wh87w]: timed out waiting for the > condition > Normal Pulled 4m45s kubelet, > ip-10-0-142-115.eu-central-1.compute.internal Container image > "quay.io/openshift-scale/scale-ci-workload" already present on machine > Normal Created 4m44s kubelet, > ip-10-0-142-115.eu-central-1.compute.internal Created container fio-pod > Normal Started 4m44s kubelet, > ip-10-0-142-115.eu-central-1.compute.internal Started container fio-pod > > so here it waits cca 19mins between these two error messages "timed out > waiting for the condition". Thanks, that's why I was requesting that information, with that, I wanted to check, is it a delay in provisioning or attaching?. from the initial reply, it looks like this is on attaching and not on provisioning side. > > Will update later this bz with test result for OCP v4.3 + OCS v4.4 Sure.
[Side note] I was going through the must gather and other log directory mentioned in c#4, unfortunately I couldn't fetch or figure out the logs. There are many logs, but I don't see any logs belonging to CSI. Can you relook at whether this has CSI pod details pertaining to this test?
(In reply to Raz Tamir from comment #10) > Thanks Elvir, > > Based on this info, marking as regression I don't think this qualifies as a regression. Few things to summarize: 1) iic, 1500 pods per 3 nodes are the supported/qualified limit ( maybe this is documented too). But, whether we scale this limit with each 3 node cluster iteration is still a question for me. Say, 12 nodes, are we saying we could support 6000? say 24 nodes, 12000? 2) Also this is not about pods are not coming up, rather iic, its about **delay** in pod getting spawned. If thats the concern in previous releases we would have told, the pods will be up and running say 'X' amount of time. Actually drawing such conclusion is really difficult as it depends on many other things.
The matrix Elvir gave above makes it clear it's a regression. If it's a reason to block the release - surely it isn't, for two reasons: 1. We believe it's an edge case - in the high end of the scale. 2. It is, as you've commented, just a delay, not a clear failure. (I'm still very interested to understand where it's coming from). Michael, thoughts?
Please describe your VMware setup. Are you sure the hypervisors, NICs or storage is not overloaded?
(In reply to Yaniv Kaul from comment #16) > The matrix Elvir gave above makes it clear it's a regression. If it's a > reason to block the release - surely it isn't, for two reasons: > 1. We believe it's an edge case - in the high end of the scale. > 2. It is, as you've commented, just a delay, not a clear failure. > > (I'm still very interested to understand where it's coming from). > I am almost sure that, this is coming from either OCP sidecars or from kubelet/controller shipped with OCP 4.4. The reason for this assumption is that there is NO code change between OCS 4.3 and OCS 4.4 wrt ceph csi builds.
Elvir, while testing this on different combinations, does the base/node software remains same? like kernel or OS release?
[Action Plan] 1) We have to get or capture the logs and analyze the logs at least once. Unfortunately both must gather dont have the logs ( c#22 and c#3) in it. Not sure why its missing atm though. 2) Cross-check with the OCP team, whether there are any changes that could cause this.
Also, to narrow it down, what is slower than before? - pvc creation? - pod creation? - pvc attaching / mounting? Can you do 1000 PVC creation without pods/mounting? Etc...
Does OCS use external-attacher? You may be hitting this: https://github.com/kubernetes/kubernetes/issues/84169. Check this comment for workarounds: https://github.com/kubernetes/kubernetes/issues/84169#issuecomment-545692734 - does this help? There is also PR in progress: https://github.com/kubernetes/kubernetes/pull/91307
(In reply to Jan Safranek from comment #35) > Does OCS use external-attacher? You may be hitting this: > https://github.com/kubernetes/kubernetes/issues/84169. > Check this comment for workarounds: > https://github.com/kubernetes/kubernetes/issues/84169#issuecomment-545692734 > - does this help? > There is also PR in progress: > https://github.com/kubernetes/kubernetes/pull/91307 Thanks Jan! Looks like we are hitting the same. I think the workaround is setting the kube-controller-manager flag disable-attach-detach-reconcile-sync to true. Elvir, can we start controller-manager with above flag to true and find the result ?
// Additional Note: Adding --worker-threads and --reconcileSync is one of the thing we could try: https://github.com/ceph/ceph-csi/blob/master/deploy/cephfs/kubernetes/csi-cephfsplugin-provisioner.yaml#L70 Also just noticed: https://github.com/rook/rook/issues/5366 where it claims `disable-attach-detach-reconcile-sync to true` fixed the issue for them. One remaining question is, why this issue is not visible in OCP 4.3 + combination. AFAICT, the code in the controller manager on this area remains even in OCP 4.3.
I have to figure out how to configure disable-attach-detach-reconcile-sync to true. I will update bz once I test with this.
Elvir, I have added or changed this configuration in your system, can you please retest?
I did some tests more - with 1000 pods in "ContainerCreating" state and none of them Running, below was done # oc get cm config -n openshift-kube-controller-manager -o yaml > controllermanager.yml add disable-attach-detach-reconcile-sync:true in the configuration and apply changes. # oc apply -f controllermanager.yml After this pods starts to move in Running state after contoller manager pods are restarted after above change. So pods are starting, but there are some things which are still problematic - If user request 1000 pods - none of pods will be started and marked "Ready" with attached PVC before all pods are created. After all pods are created , they will start gradually moving to Ready state. Previsouly pods were moved to "Ready" state while other pods were creating. - Pods are starting slower than with OCP v4.3 / OCS v4.3 and start time for pods with PVC attached is bad. Also, how to implement this change permanently to do more exensive tests?
Thanks Elvir for the detailed testing and sharing (c#42) the result. Sure, as this is not specific to OCS and mostly on the controller side, we could work with OCP team, will do clone of this bz, and proceed. ps# I am also experimenting with some of the sidecar options mentioned in c#37 too. I will update if I get any positive result with those.
Elad,Raz, considering the debugging so far or the isolation, can we remove the regression/blocker flag ..etc?
(In reply to Humble Chirammal from comment #45) > Elad,Raz, considering the debugging so far or the isolation, can we remove > the regression/blocker flag ..etc? This can not even be a question any more. * it is NOT an OCS regression. * it *IS* an OCP regression. * upstream issue: https://github.com/kubernetes/kubernetes/issues/84169 ==> we need to: * remove the blocker flag here ==> DONE and updated the Subject * find or create a bug in OCP to track this and have this one depend on the OCP bug ==> Can someone please take this up?
resetting assignees
OK. bot is re-adding blocker b/c of regression. removing regression tag to be able to remove blocker...
(In reply to Michael Adam from comment #48) > (In reply to Humble Chirammal from comment #45) > > * find or create a bug in OCP to track this and have this one depend on the > OCP bug > ==> Can someone please take this up? There is already a OCP bug which I created to track it and its available here https://bugzilla.redhat.com/show_bug.cgi?id=1839933
Thinking about the solution or avoiding the possibilities of getting into the situation: This is what I could come up with. That said, the Ceph CSI driver does not make use of CONTROLLER PUBLISH and UNPUBLISH calls or these capabilities are not exposed from the driver. However with the history of the development of CSI and Ceph CSI plugin we ***were*** making use of external-attacher sidecar till now. But, the upstream has a feature or solution to `skip attach` being a part of the CSIDriver object. Considering this is GA with "1.18", my proposal is (https://github.com/ceph/ceph-csi/issues/1106) to get rid of 'external-attacher" completely in Ceph CSI driver implementation by making use of this field. However this needs extensive testing in the driver..etc which we will get on with release v3.0.0 of upstream.
Considering this is not a regression at OCS end and this is getting tracked for OCP, I would like to consider this for OCS 4.5. This is also about scaling to a very large set and just on performance.
Removing the needonfo as it was already decided that it's not a blocker for 4.4. Sorry for the late reply.
From OCS side, it still falls in ceph-csi side hence changing the component field. Looking at the discussions in https://github.com/ceph/ceph-csi/issues/1106, I don't think we are going to disable external attacher soon. Doen't look like a 4.5 candidate to me. @Humble any comments?
fix direction of dependency
(In reply to Humble Chirammal from comment #53) > Thinking about the solution or avoiding the possibilities of getting into > the situation: > > This is what I could come up with. That said, the Ceph CSI driver does not > make use of > CONTROLLER PUBLISH and UNPUBLISH calls or these capabilities are not exposed > from the driver. > However with the history of the development of CSI and Ceph CSI plugin we > ***were*** making use of external-attacher sidecar till now. But, the > upstream has a feature or solution to `skip attach` being a part of the > CSIDriver object. Considering this is GA with "1.18", my proposal is > (https://github.com/ceph/ceph-csi/issues/1106) to get rid of > 'external-attacher" completely in Ceph CSI driver implementation by making > use of this field. However this needs extensive testing in the driver..etc > which we will get on with release v3.0.0 of upstream. Based on https://github.com/ceph/ceph-csi/issues/1106#issuecomment-642252773 , removing the attacher is discouraged. Let's use this bug as a tracker for the OCP bug in the meantime. The OCP BZ is targeted for 4.6. ==> moving this BZ out to OCS 4.6 too
This one can move to ON_QA - the tracked BZ is already verified (for OCP 4.6). The 4.5 one is still on POST - https://bugzilla.redhat.com/show_bug.cgi?id=1854311 The 4.4. one is still on ASSIGNED - https://bugzilla.redhat.com/show_bug.cgi?id=1854312
(In reply to Yaniv Kaul from comment #67) > This one can move to ON_QA - the tracked BZ is already verified (for OCP > 4.6). > The 4.5 one is still on POST - > https://bugzilla.redhat.com/show_bug.cgi?id=1854311 > The 4.4. one is still on ASSIGNED - > https://bugzilla.redhat.com/show_bug.cgi?id=1854312 hmm... Ideally this should be moved to ON_QA when OCP 4.5 and OCP 4.4 are ON_QA. The reason being, even though this is a tracker for OCS 4.6, the validation test bed can be OCP 4.5, OCP 4.4..etc, where this will be failed unless the fix build is not available.
(In reply to Humble Chirammal from comment #68) > (In reply to Yaniv Kaul from comment #67) > > This one can move to ON_QA - the tracked BZ is already verified (for OCP > > 4.6). > > The 4.5 one is still on POST - > > https://bugzilla.redhat.com/show_bug.cgi?id=1854311 > > The 4.4. one is still on ASSIGNED - > > https://bugzilla.redhat.com/show_bug.cgi?id=1854312 > > hmm... Ideally this should be moved to ON_QA when OCP 4.5 and OCP 4.4 are > ON_QA. The reason being, even though this is a tracker for OCS 4.6, the > validation test bed can be OCP 4.5, OCP 4.4..etc, where this will be failed > unless the fix build is not available. If that's a tracker for ALL the above bugs, we can postpone this move. If we want QE to test the functionality, it makes more sense to split.
Respective OCP bz https://bugzilla.redhat.com/show_bug.cgi?id=1854311 is already in ON_QA Update from QE: we were able to create mode than 1200+ pods without any issues, tested this in independent mode cluster. Note: Earlier we observed PVC attach problem with just 300+ pods. OCP Version: 4.5.0-0.nightly-2020-07-22-074214 OCS Version: ocs-operator.v4.5.0-494.ci @Elvir have you tried testing in OCP 4.5 with above build or the latest ? any updates ?
(In reply to Ramakrishnan Periyasamy from comment #70) > Respective OCP bz https://bugzilla.redhat.com/show_bug.cgi?id=1854311 is > already in ON_QA > > Update from QE: we were able to create mode than 1200+ pods without any > issues, tested this in independent mode cluster. Note: Earlier we observed > PVC attach problem with just 300+ pods. > > OCP Version: 4.5.0-0.nightly-2020-07-22-074214 > OCS Version: ocs-operator.v4.5.0-494.ci > > @Elvir have you tried testing in OCP 4.5 with above build or the latest ? > any updates ? Yes, with OCP v4.6, worked fine https://bugzilla.redhat.com/show_bug.cgi?id=1836198#c64
*** Bug 1862196 has been marked as a duplicate of this bug. ***
*** Bug 1861438 has been marked as a duplicate of this bug. ***
This is already ON_QA, but we don't have qa ack?
(In reply to Mudit Agarwal from comment #80) > This is already ON_QA, but we don't have qa ack? QE ack was available, but it looks like in some changes it went away. Raz, can you bring back QE ack here?
As this was fixed in OCP 4.6 and backported to 4.5.z and 4.4.z, should we also create tracker for OCS 4.4.z and keep this for 4.5?
(In reply to Elad from comment #82) > As this was fixed in OCP 4.6 and backported to 4.5.z and 4.4.z, should we > also create tracker for OCS 4.4.z and keep this for 4.5? If I am not wrong, this is just an OCP tracker and there is no change from OCS side. So it is upto QE, if you want a 4.4.z BZ to track testing. From dev side I don't see a need to open a 4.4.z BZ Also, qa_ack still pending. Can you please provide, I don't have permissions.