Description of problem (please be detailed as possible and provide log snippests): Redhat deploy a OCS 4.8 cluster. After initial deployment "FlexibleAutoscaling: true" need to be set on StorageCluster in openshift-storage namespace, and that require a redeployment. After this redeployment pvc for ocs-storagecluster-cephfs does not work and pvc stay in pending state instead of Bound. Version of all relevant components (if applicable): OCS 4.8, same issue has been reproduced in OCS 4.7 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? This issue had an impact for multiple day. pvc for was ocs-storagecluster-cephfs was not working anymore. Is there any workaround available to the best of your knowledge? The following github rook issue describe the situation : https://github.com/rook/rook/issues/6183 Following comment from the previous github link give us clue and solution : Muyan0828 commented on Mar 4, 2021 "I think the root case is that your filesystem was recreate, cephfs-csi stores a bool variable in memory for each cluster to mark whether a subvolumegroup has been created, cephfs-csi corresponds to a cluster via the clusterID field in StorageClass, which is set to the namespace of the cephcluster in rook, so when the CephFilesystem is rebuilt in the same namespace and StorageClass is recreated, in cephfs-csi the variable for the cluster is true, so there is no attempt to recreate the subvolumegroup" So the following was try on a rook-ceph-operator pod : ceph fs subvolumegroup ls ocs-storagecluster-cephfilesystem [] Which give us not the same result as on a working OCS. We expected the following result : ceph fs subvolumegroup ls ocs-storagecluster-cephfilesystem [ { "name": "_deleting" }, { "name": "csi" } ] Restarting the following pod solve the situation : oc delete pod -l app=csi-cephfsplugin-provisioner oc delete pod -l app=csi-cephfsplugin At this point PVC can be "bound". Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? This issue has been reproduced with different scenario on OCS 4.7. Removing the "cephfs" filesystem manually using the following procedure ( not recommanded for debugging purpose ) : ceph fs volume rm ocs-storagecluster-cephfilesystem --yes-i-really-mean-it ceph fs volume create ocs-storagecluster-cephfilesystem Or removing StorageCluster in openshift-storage and recreate it produce the issue. Can this issue reproduce from the UI? Not to my knowledge. If this is a regression, please provide more details to justify this: This is not a regression from my understanding. It is more a behaviour. The operator in the "openshift-storage" might need to be removed to not fall into the same scenario. Steps to Reproduce: The following procedure was used : https://gitlab.consulting.redhat.com/emea-telco-practice/telefonica-next/mano-cluster/-/tree/maindc1/05.OCS ==== oc apply -f 1.LSO.yml oc apply -f 2.labelNodes.yaml oc apply -f 3.localVolumeSet.yml ---------------> LSO Operator and local volume are provisioned. oc apply -f 4.OCS.yml --------------> OCS Operator deployed. oc apply -f 5.storageCluster.yml -----------> OCS cluster deployed At this exact step everything was working as expected. Then the StorageCluster in openshift-storage namespace was removed oc delete -f 5.storageCluster.yml ----------- > OCS cluster deleted, BUT, LSO and OCS operators are installed and running. vi 5.storageCluster.yml -----------> Enabled FlexibleAutoscaling: true oc apply -f 5.storageCluster.yml -----------> Redeployed OCS cluster with felxibleAustoSacling true. Actual results: pvc in pending state for ocs-storagecluster-cephfs not able to be bound oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mypvc Pending ocs-storagecluster-cephfs 10s oc logs -n openshift-storage csi-cephfsplugin-provisioner-544d647bc4-svtwz csi-provisioner|less I0902 13:05:32.877114 1 event.go:282] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"test", Name:"mypvc", UID:"7e2a57bc-ee1f-49d9-8951-66fdaa2a4f74", APIVersion:"v1", ResourceVersion:"359756", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "ocs-storagecluster-cephfs": rpc error: code = Internal desc = rados: ret=-2, No such file or directory: "subvolume group 'csi' does not exist" Expected results: Here there is no expected result. I can say, that I expect pvc to be bounded, but this case is more a description of what has been discover and discuss. This is a behavior of OCS which need to be more describe in the OCS documentation. Additional info: We expect the following feedback from engineering : - what CRD should have been removed to be able to configure "FlexibleAutoscaling: true" without falling into the trap - Does the expected workaround look to be safe and accurate, for now we don't see anything which lead to any issue on pvc side - Do you think we might miss something obvious here from the supportability of the filesystem or the other service in the future ?
Changes have been synced in the downstream devel and release-4.12 branches. ODF-4.12 is expected to have a fix for this bug.
Tested on ODF version 4.12.0-91 following the steps in comment #13. Flow: # List subvolume sh-4.4$ ceph fs subvolume ls ocs-storagecluster-cephfilesystem csi [ { "name": "csi-vol-7c13ed73-9cb3-4572-841d-d44070fa82d1" } ] # Remove subvolume & subvolumegroup sh-4.4$ ceph fs subvolume rm ocs-storagecluster-cephfilesystem csi-vol-7c13ed73-9cb3-4572-841d-d44070fa82d1 csi sh-4.4$ ceph fs subvolumegroup rm ocs-storagecluster-cephfilesystem csi # List subvolumegroup sh-4.4$ ceph fs subvolumegroup ls ocs-storagecluster-cephfilesystem [] # Create PVC [ybenshim@localhost tmp]$kubectl create -f pvc.yaml persistentvolumeclaim/cephfs-pvc created [ybenshim@localhost tmp]$kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE cephfs-pvc Bound pvc-978f4174-79ce-435f-b8ae-2fa794067c57 1Gi RWO ocs-storagecluster-cephfs 3s db-noobaa-db-pg-0 Bound pvc-462b678f-d2b5-43fb-be74-e992a5d9c194 50Gi RWO ocs-storagecluster-ceph-rbd 31h ocs-deviceset-0-data-0lm4lh Bound pvc-6ce4900b-6c5b-470e-875d-cda2d29b3f9a 100Gi RWO thin 31h ocs-deviceset-1-data-0h4khv Bound pvc-9dce2a0c-791f-4ecd-9ffb-0461c66941a9 100Gi RWO thin 31h ocs-deviceset-2-data-06ff4f Bound pvc-58027f41-4345-45d5-a9fc-b9d4e06c4798 100Gi RWO thin 31h rook-ceph-mon-a Bound pvc-7467daea-de4d-4a59-811d-53eb723d627d 50Gi RWO thin 31h rook-ceph-mon-b Bound pvc-f8629fe3-31eb-4aa8-8347-a2871b540f7c 50Gi RWO thin 31h rook-ceph-mon-c Bound pvc-4ddb2f12-a271-44ee-b68a-e69a0ac73889 50Gi RWO thin 31h sh-4.4$ ceph fs subvolumegroup ls ocs-storagecluster-cephfilesystem [ { "name": "csi" } ] After subvolumegoup deletion, the PVC was created and reached bound state. Moving to VERIFIED