2219628 – [RDR] After workloads are deleted, VRG deletion remains stuck for several hours, rbd false image count is shown and ceph command hangs on secondary

Bug 2219628 - [RDR] After workloads are deleted, VRG deletion remains stuck for several hours, rbd false image count is shown and ceph command hangs on secondary [NEEDINFO]

Summary: [RDR] After workloads are deleted, VRG deletion remains stuck for several hou...

Keywords:
Status:	ASSIGNED
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Matt Benjamin (redhat)
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-07-04 16:08 UTC by Aman Agrawal
Modified:	2024-09-11 17:20 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
Flags:	amagrawa: needinfo? (nojha) amagrawa: needinfo? (akupczyk) amagrawa: needinfo? (mbenjamin) muagarwa: needinfo? (amagrawa)

Attachments	(Terms of Use)

Comment 6 kmanohar 2023-07-10 10:45:27 UTC

Same issue has been observed on RDR Longevity setup

OCP Version - 4.13.0-0.nightly-2023-06-05-164816
ODF - ODF 4.13.0-219.snaptrim
SUBMARINER version:- v0.15.1
VOLSYNC version:- volsync-product.v0.7.1


Issue seen in volumereplication yaml

vr yaml output
--------------

oc get vr busybox-pvc-61 -o yaml

apiVersion: replication.storage.openshift.io/v1alpha1
kind: VolumeReplication
metadata:
  creationTimestamp: "2023-07-10T08:04:25Z"
  finalizers:
  - replication.storage.openshift.io
  generation: 1
  name: busybox-pvc-61
  namespace: appset-busybox-4
  ownerReferences:
  - apiVersion: ramendr.openshift.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: VolumeReplicationGroup
    name: busybox-4-placement-drpc
    uid: 6f21ad83-16e0-4eb9-98bf-e43b9fb9bdf0
  resourceVersion: "36486402"
  uid: a85e701c-4109-49a5-9dd6-fcb682a818bf
spec:
  autoResync: false
  dataSource:
    apiGroup: ""
    kind: PersistentVolumeClaim
    name: busybox-pvc-61
  replicationHandle: ""
  replicationState: primary
  volumeReplicationClass: rbd-volumereplicationclass-2263283542
status:
  conditions:
  - lastTransitionTime: "2023-07-10T08:04:26Z"
    message: ""
    observedGeneration: 1
    reason: FailedToPromote
    status: "False"
    type: Completed
  - lastTransitionTime: "2023-07-10T08:04:26Z"
    message: ""
    observedGeneration: 1
    reason: Error
    status: "True"
    type: Degraded
  - lastTransitionTime: "2023-07-10T08:04:26Z"
    message: ""
    observedGeneration: 1
    reason: NotResyncing
    status: "False"
    type: Resyncing
  message: 'rados: ret=-11, Resource temporarily unavailable'
  observedGeneration: 1
  state: Unknown

Must gather logs
----------------

c1 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2219628/july10/c1/

c2 - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2219628/july10/c2/

hub - http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2219628/july10/hub/

Live setup is available for debugging

Comment 10 Aman Agrawal 2023-07-10 17:36:53 UTC

(In reply to kmanohar from comment #6)
> Same issue has been observed on RDR Longevity setup
> 
> OCP Version - 4.13.0-0.nightly-2023-06-05-164816
> ODF - ODF 4.13.0-219.snaptrim
> SUBMARINER version:- v0.15.1
> VOLSYNC version:- volsync-product.v0.7.1
> 
> 
> Issue seen in volumereplication yaml
> 
> vr yaml output
> --------------
> 
> oc get vr busybox-pvc-61 -o yaml
> 
> apiVersion: replication.storage.openshift.io/v1alpha1
> kind: VolumeReplication
> metadata:
>   creationTimestamp: "2023-07-10T08:04:25Z"
>   finalizers:
>   - replication.storage.openshift.io
>   generation: 1
>   name: busybox-pvc-61
>   namespace: appset-busybox-4
>   ownerReferences:
>   - apiVersion: ramendr.openshift.io/v1alpha1
>     blockOwnerDeletion: true
>     controller: true
>     kind: VolumeReplicationGroup
>     name: busybox-4-placement-drpc
>     uid: 6f21ad83-16e0-4eb9-98bf-e43b9fb9bdf0
>   resourceVersion: "36486402"
>   uid: a85e701c-4109-49a5-9dd6-fcb682a818bf
> spec:
>   autoResync: false
>   dataSource:
>     apiGroup: ""
>     kind: PersistentVolumeClaim
>     name: busybox-pvc-61
>   replicationHandle: ""
>   replicationState: primary
>   volumeReplicationClass: rbd-volumereplicationclass-2263283542
> status:
>   conditions:
>   - lastTransitionTime: "2023-07-10T08:04:26Z"
>     message: ""
>     observedGeneration: 1
>     reason: FailedToPromote
>     status: "False"
>     type: Completed
>   - lastTransitionTime: "2023-07-10T08:04:26Z"
>     message: ""
>     observedGeneration: 1
>     reason: Error
>     status: "True"
>     type: Degraded
>   - lastTransitionTime: "2023-07-10T08:04:26Z"
>     message: ""
>     observedGeneration: 1
>     reason: NotResyncing
>     status: "False"
>     type: Resyncing
>   message: 'rados: ret=-11, Resource temporarily unavailable'
>   observedGeneration: 1
>   state: Unknown
> 
> Must gather logs
> ----------------
> 
> c1 -
> http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2219628/july10/
> c1/
> 
> c2 -
> http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2219628/july10/
> c2/
> 
> hub -
> http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2219628/july10/
> hub/
> 
> Live setup is available for debugging

This issue has been reported separately and is being tracked by BZ2221716. @akupczyk Could you pls update this BZ instead with your observations from the longevity setup Pratik shared with you offline?

Note You need to log in before you can comment on or make changes to this bug.