Bug 2264051 - Clone PVC with different access mode - all mgr pods in CLBO, cloned PVC never reach Bound
Summary: Clone PVC with different access mode - all mgr pods in CLBO, cloned PVC never...
Keywords:
Status: CLOSED DUPLICATE of bug 2258357
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.15
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Malay Kumar parida
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-02-13 15:48 UTC by Daniel Osypenko
Modified: 2024-03-05 07:43 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-02-22 13:33:34 UTC
Embargoed:
khiremat: needinfo-


Attachments (Terms of Use)

Description Daniel Osypenko 2024-02-13 15:48:00 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Running the test test_clone_with_different_access_mode which creates 9 clones of an existing PVCs with an access mode different then parent PVC causes the mgr pod CLBO state.

Deployment: Downstream-OCP4-15-VSPHERE6-UPI-ENCRYPTION-1AZ-RHCOS-VSAN-LSO-VMDK-3M-3W

Clone with yaml bellow has stuck in Pending state

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: clone-pvc-test-47913a39ff9548-bec6e61beb
  namespace: namespace-test-b0ccbc497af947f7981fb8937
spec:
  accessModes:
  - ReadWriteMany
  dataSource:
    kind: PersistentVolumeClaim
    name: pvc-test-47913a39ff95488caaf4418f7beae23
  resources:
    requests:
      storage: 3Gi
  storageClassName: ocs-storagecluster-cephfs


2024-01-03 21:30:12  Name:          clone-pvc-test-47913a39ff9548-bec6e61beb
2024-01-03 21:30:12  Namespace:     namespace-test-b0ccbc497af947f7981fb8937
2024-01-03 21:30:12  StorageClass:  ocs-storagecluster-cephfs
2024-01-03 21:30:12  Status:        Pending
2024-01-03 21:30:12  Volume:        
2024-01-03 21:30:12  Labels:        <none>
2024-01-03 21:30:12  Annotations:   volume.beta.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com
2024-01-03 21:30:12                 volume.kubernetes.io/storage-provisioner: openshift-storage.cephfs.csi.ceph.com
2024-01-03 21:30:12  Finalizers:    [kubernetes.io/pvc-protection]
2024-01-03 21:30:12  Capacity:      
2024-01-03 21:30:12  Access Modes:  
2024-01-03 21:30:12  VolumeMode:    Filesystem
2024-01-03 21:30:12  DataSource:
2024-01-03 21:30:12    Kind:   PersistentVolumeClaim
2024-01-03 21:30:12    Name:   pvc-test-47913a39ff95488caaf4418f7beae23
2024-01-03 21:30:12  Used By:  <none>
2024-01-03 21:30:12  Events:
2024-01-03 21:30:12    Type     Reason                Age                  From                                                                                                                     Message
2024-01-03 21:30:12    ----     ------                ----                 ----                                                                                                                     -------
2024-01-03 21:30:12    Warning  ProvisioningFailed    3m32s                openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-cc6b5b547-nkxj2_e0a815a7-e343-44db-ab00-151a5a1712d8  failed to provision volume with StorageClass "ocs-storagecluster-cephfs": rpc error: code = Aborted desc = clone from snapshot is pending
2024-01-03 21:30:12    Normal   Provisioning          56s (x9 over 3m32s)  openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-cc6b5b547-nkxj2_e0a815a7-e343-44db-ab00-151a5a1712d8  External provisioner is provisioning volume for claim "namespace-test-b0ccbc497af947f7981fb8937/clone-pvc-test-47913a39ff9548-bec6e61beb"
2024-01-03 21:30:12    Warning  ProvisioningFailed    40s (x8 over 3m31s)  openshift-storage.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-cc6b5b547-nkxj2_e0a815a7-e343-44db-ab00-151a5a1712d8  failed to provision volume with StorageClass "ocs-storagecluster-cephfs": rpc error: code = Aborted desc = clone from snapshot is already in progress
2024-01-03 21:30:12    Normal   ExternalProvisioning  5s (x16 over 3m32s)  persistentvolume-controller                                                                                              Waiting for a volume to be created either by the external provisioner 'openshift-storage.cephfs.csi.ceph.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.


Alerts captured:
state': 'firing', 'activeAt': '2024-01-03T17:05:30.245594999Z', 'value': '2.102324499988556e+04'}, {'labels': {'alertname': 'KubePodCrashLooping', 'container': 'mgr', 'endpoint': 'https-main', 'job': 'kube-state-metrics', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-b-7f5c74fb5d-jlc52', 'reason': 'CrashLoopBackOff', 'service': 'kube-state-metrics', 'severity': 'warning', 'uid': '82dcdb73-34ff-423e-aae0-161827947181'}, 'annotations': {'description': 'Pod openshift-storage/rook-ceph-mgr-b-7f5c74fb5d-jlc52 (mgr) is in waiting state (reason: "CrashLoopBackOff").', 'summary': 'Pod is crash looping.'}, 'state': 'firing', 'activeAt': '2024-01-03T19:29:22.487572828Z', 'value': '1e+00'}, {'labels': {'alertname': 'KubePodCrashLooping', 'container': 'mgr', 'endpoint': 'https-main', 'job': 'kube-state-metrics', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-5448cf4785-q68wp', 'reason': 'CrashLoopBackOff', 'service': 'kube-state-metrics', 'severity': 'warning', 'uid': 'b58946ed-10b4-44ca-993d-6ed882173642'}, 'annotations': {'description': 'Pod openshift-storage/rook-ceph-mgr-a-5448cf4785-q68wp (mgr) is in waiting state (reason: "CrashLoopBackOff").', 'summary': 'Pod is crash looping.'}

test log - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-061vue1cslv33-uba/j-061vue1cslv33-uba_20240103T120010/logs/ocs-ci-logs-1704292772/by_outcome/failed/tests/functional/pv/pvc_clone/test_clone_with_different_access_mode.py/TestCloneWithDifferentAccessMode/test_clone_with_different_access_mode/logs

Version of all relevant components (if applicable):
OCS 4.15.0-100
OCP 4.15.0-0.nightly-2024-01-03-015912

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
2/2

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:
test never failed on ODF 4.14 but fails on ODF 4.15

Steps to Reproduce:
1. Create CephFileSystem and CephBlockPool PVCs of different volume modes and access modes
2. Attach pvc to a fio pods, run IO to fill-up 1 Gi from existing 3 Gi of each PVC
3. Clone each PVC with different access mode that the original PVC, verify each cloned PVC status


Actual results:
pvc stuck in pending, mgr pods are in CLBO

Expected results:
each cloned pvc is bound 

Additional info:
must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-061vue1cslv33-uba/j-061vue1cslv33-uba_20240103T120010/logs/testcases_1704292772/j-061vue1cslv33-u/

2 similar failures on the same deployment on ODF 4.15:
Downstream-OCP4-15-VSPHERE6-UPI-ENCRYPTION-1AZ-RHCOS-VSAN-LSO-VMDK-3M-3W
test never failed on ODF 4.14

Comment 3 Daniel Osypenko 2024-02-14 09:20:20 UTC
> Do we have other tests just creating multiple parallel cephfs clones in a similar way and are those tests passing ?
I don't see such tests but I can add more to picture. From observed tests-history we had 19 tests Passed on multiple different platforms on ODF 4.15, we had all 14 tests Passed on 4.14 including 7 passed on same Downstream-OCP4-15-VSPHERE6-UPI-ENCRYPTION-1AZ-RHCOS-VSAN-LSO-VMDK-3M-3W cluster, and we don't have history of passing tests on the same cluster type on ODF 4.15, all 2 test runs were failing with mgr pods in CLBO and PVCs Pending.

@jijoy wdyt?

Comment 5 Daniel Osypenko 2024-02-19 08:16:50 UTC
@khiremat, 
> I didn't the understand the testcase.

please take a look on the test log, it has every command and cr applied during the test with the timestamps added. 
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-061vue1cslv33-uba/j-061vue1cslv33-uba_20240103T120010/logs/ocs-ci-logs-1704292772/by_outcome/failed/tests/functional/pv/pvc_clone/test_clone_with_different_access_mode.py/TestCloneWithDifferentAccessMode/test_clone_with_different_access_mode/logs

basically the test clones the PVC but changes the "accessModes", for example ReadWriteMany -> ReadWriteOnce, etc.

Comment 11 Malay Kumar parida 2024-02-22 13:33:34 UTC

*** This bug has been marked as a duplicate of bug 2258357 ***


Note You need to log in before you can comment on or make changes to this bug.