Bug 2104844

Summary:	[RDR] Cleanup of primary cluster is stuck and never completes when relocate operation is performed
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Aman Agrawal <amagrawa>
Component:	ceph	Assignee:	Ilya Dryomov <idryomov>
ceph sub component:	RBD-Mirror	QA Contact:	Sidhant Agrawal <sagrawal>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	bmekhiss, bniver, idryomov, kramdoss, kseeger, madam, mrajanna, muagarwa, ocs-bugs, odf-bz-bot, sagrawal, srangana, vashastr
Version:	4.10
Target Milestone:	---
Target Release:	ODF 4.12.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.11.0-137	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	2105454 2116493 (view as bug list)		Environment:
Last Closed:	2023-01-31 00:19:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2105454
Bug Blocks:

Comment 6 Benamar Mekhissi 2022-07-07 12:59:57 UTC

@mrajanna @idryomov I see the following. Any ideas???

rbd mirror image status on C2 is reporting an error:
```
ceph-tools-55b98f657d-h647k -- rbd mirror image status csi-vol-89d68c37-fc45-11ec-824c-0a580a850227 --pool ocs-storagecluster-cephblockpool
csi-vol-89d68c37-fc45-11ec-824c-0a580a850227:
  global_id:   5334c11b-2d9c-47af-9c5c-f56fc31f4407
  state:       up+error
  description: incomplete local non-primary snapshot
  service:     a on dhcp161-177.lab.eng.blr.redhat.com
  last_update: 2022-07-07 12:55:18
  peer_sites:
    name: c93bfe26-f907-4492-9bb8-f6d93fdbe5a8
    state: up+stopped
    description: local image is primary
    last_update: 2022-07-07 12:55:31
```

The VR on C2 is reporting failure to disable volume replication
```
{"level":"error","timestamp":"2022-07-07T12:25:57.113Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:198","msg":"failed to disable volume replication","Request.Name":"busybox-pvc-85","Request.Namespace":"busybox-workloads-5","error":"rpc error: code = InvalidArgument desc = secondary image status is up=true and state=error"}
```

On C1, I didn't see that we force promoted:
```
{"level":"info","timestamp":"2022-07-05T09:36:39.589Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:191","msg":"adding finalizer to PersistentVolumeClaim object","Request.Name":"busybox-pvc-85","Request.Namespace":"busybox-workloads-5","Finalizer":"replication.storage.openshift.io/pvc-protection"}
{"level":"error","timestamp":"2022-07-05T09:36:40.618Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:248","msg":"failed to promote volume","Request.Name":"busybox-pvc-85","Request.Namespace":"busybox-workloads-5","error":"rpc error: code = Internal desc = ocs-storagecluster-cephblockpool/csi-vol-89d68c37-fc45-11ec-824c-0a580a850227 mirrored image is not healthy. State is up=false, state=\"unknown\""}
{"level":"error","timestamp":"2022-07-05T09:36:40.618Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"failed to Replicate","Request.Name":"busybox-pvc-85","Request.Namespace":"busybox-workloads-5","ReplicationState":"primary","error":"rpc error: code = Internal desc = ocs-storagecluster-cephblockpool/csi-vol-89d68c37-fc45-11ec-824c-0a580a850227 mirrored image is not healthy. State is up=false, state=\"unknown\""}
{"level":"error","timestamp":"2022-07-05T09:36:40.624Z","logger":"controller-runtime.manager.controller.volumereplication","caller":"controller/controller.go:253","msg":"Reconciler error","reconciler group":"replication.storage.openshift.io","reconciler kind":"VolumeReplication","name":"busybox-pvc-85","namespace":"busybox-workloads-5","error":"rpc error: code = Internal desc = ocs-storagecluster-cephblockpool/csi-vol-89d68c37-fc45-11ec-824c-0a580a850227 mirrored image is not healthy. State is up=false, state=\"unknown\""}
```

Comment 33 Mudit Agarwal 2022-08-11 04:58:04 UTC

Pls provide doc text

Comment 38 Mudit Agarwal 2022-08-17 13:17:31 UTC

Karthick, this means that we need to move it back to 4.11.0 and mark it ON_QA. Please confirm.

Comment 39 krishnaram Karthick 2022-08-18 14:37:28 UTC

(In reply to Mudit Agarwal from comment #38)
> Karthick, this means that we need to move it back to 4.11.0 and mark it
> ON_QA. Please confirm.

yes, you are correct. could you please move it to on_qa.

Comment 60 errata-xmlrpc 2023-01-31 00:19:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:0551

Comment 61 Red Hat Bugzilla 2023-12-08 04:29:30 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days