Bug 2219628

Summary: [RDR] After workloads are deleted, VRG deletion remains stuck for several hours, rbd false image count is shown and ceph command hangs on secondary
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aman Agrawal <amagrawa>
Component: Multi-Cloud Object GatewayAssignee: Romy Ayalon <rayalon>
Status: ASSIGNED --- QA Contact: krishnaram Karthick <kramdoss>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.13CC: akupczyk, bniver, kmanohar, muagarwa, nbecker, nojha, odf-bz-bot, prsurve, rayalon, sostapov, srangana
Target Milestone: ---Flags: amagrawa: needinfo? (nojha)
amagrawa: needinfo? (akupczyk)
rayalon: needinfo? (srangana)
amagrawa: needinfo? (srangana)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Comment 6 kmanohar 2023-07-10 10:45:27 UTC
Same issue has been observed on RDR Longevity setup

OCP Version - 4.13.0-0.nightly-2023-06-05-164816
ODF - ODF 4.13.0-219.snaptrim
SUBMARINER version:- v0.15.1
VOLSYNC version:- volsync-product.v0.7.1


Issue seen in volumereplication yaml

vr yaml output
--------------

oc get vr busybox-pvc-61 -o yaml

apiVersion: replication.storage.openshift.io/v1alpha1
kind: VolumeReplication
metadata:
  creationTimestamp: "2023-07-10T08:04:25Z"
  finalizers:
  - replication.storage.openshift.io
  generation: 1
  name: busybox-pvc-61
  namespace: appset-busybox-4
  ownerReferences:
  - apiVersion: ramendr.openshift.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: VolumeReplicationGroup
    name: busybox-4-placement-drpc
    uid: 6f21ad83-16e0-4eb9-98bf-e43b9fb9bdf0
  resourceVersion: "36486402"
  uid: a85e701c-4109-49a5-9dd6-fcb682a818bf
spec:
  autoResync: false
  dataSource:
    apiGroup: ""
    kind: PersistentVolumeClaim
    name: busybox-pvc-61
  replicationHandle: ""
  replicationState: primary
  volumeReplicationClass: rbd-volumereplicationclass-2263283542
status:
  conditions:
  - lastTransitionTime: "2023-07-10T08:04:26Z"
    message: ""
    observedGeneration: 1
    reason: FailedToPromote
    status: "False"
    type: Completed
  - lastTransitionTime: "2023-07-10T08:04:26Z"
    message: ""
    observedGeneration: 1
    reason: Error
    status: "True"
    type: Degraded
  - lastTransitionTime: "2023-07-10T08:04:26Z"
    message: ""
    observedGeneration: 1
    reason: NotResyncing
    status: "False"
    type: Resyncing
  message: 'rados: ret=-11, Resource temporarily unavailable'
  observedGeneration: 1
  state: Unknown

Must gather logs
----------------

c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2219628/july10/c1/

c2 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2219628/july10/c2/

hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2219628/july10/hub/

Live setup is available for debugging

Comment 10 Aman Agrawal 2023-07-10 17:36:53 UTC
(In reply to kmanohar from comment #6)
> Same issue has been observed on RDR Longevity setup
> 
> OCP Version - 4.13.0-0.nightly-2023-06-05-164816
> ODF - ODF 4.13.0-219.snaptrim
> SUBMARINER version:- v0.15.1
> VOLSYNC version:- volsync-product.v0.7.1
> 
> 
> Issue seen in volumereplication yaml
> 
> vr yaml output
> --------------
> 
> oc get vr busybox-pvc-61 -o yaml
> 
> apiVersion: replication.storage.openshift.io/v1alpha1
> kind: VolumeReplication
> metadata:
>   creationTimestamp: "2023-07-10T08:04:25Z"
>   finalizers:
>   - replication.storage.openshift.io
>   generation: 1
>   name: busybox-pvc-61
>   namespace: appset-busybox-4
>   ownerReferences:
>   - apiVersion: ramendr.openshift.io/v1alpha1
>     blockOwnerDeletion: true
>     controller: true
>     kind: VolumeReplicationGroup
>     name: busybox-4-placement-drpc
>     uid: 6f21ad83-16e0-4eb9-98bf-e43b9fb9bdf0
>   resourceVersion: "36486402"
>   uid: a85e701c-4109-49a5-9dd6-fcb682a818bf
> spec:
>   autoResync: false
>   dataSource:
>     apiGroup: ""
>     kind: PersistentVolumeClaim
>     name: busybox-pvc-61
>   replicationHandle: ""
>   replicationState: primary
>   volumeReplicationClass: rbd-volumereplicationclass-2263283542
> status:
>   conditions:
>   - lastTransitionTime: "2023-07-10T08:04:26Z"
>     message: ""
>     observedGeneration: 1
>     reason: FailedToPromote
>     status: "False"
>     type: Completed
>   - lastTransitionTime: "2023-07-10T08:04:26Z"
>     message: ""
>     observedGeneration: 1
>     reason: Error
>     status: "True"
>     type: Degraded
>   - lastTransitionTime: "2023-07-10T08:04:26Z"
>     message: ""
>     observedGeneration: 1
>     reason: NotResyncing
>     status: "False"
>     type: Resyncing
>   message: 'rados: ret=-11, Resource temporarily unavailable'
>   observedGeneration: 1
>   state: Unknown
> 
> Must gather logs
> ----------------
> 
> c1 -
> http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2219628/july10/
> c1/
> 
> c2 -
> http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2219628/july10/
> c2/
> 
> hub -
> http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2219628/july10/
> hub/
> 
> Live setup is available for debugging

This issue has been reported separately and is being tracked by BZ2221716. @akupczyk Could you pls update this BZ instead with your observations from the longevity setup Pratik shared with you offline?