Bug 2104844

Summary: [RDR] Cleanup of primary cluster is stuck and never completes when relocate operation is performed
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aman Agrawal <amagrawa>
Component: cephAssignee: Ilya Dryomov <idryomov>
ceph sub component: RBD-Mirror QA Contact: Sidhant Agrawal <sagrawal>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: bmekhiss, bniver, idryomov, kramdoss, kseeger, madam, mrajanna, muagarwa, ocs-bugs, odf-bz-bot, sagrawal, srangana, vashastr
Version: 4.10   
Target Milestone: ---   
Target Release: ODF 4.12.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.11.0-137 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2105454 2116493 (view as bug list) Environment:
Last Closed: 2023-01-31 00:19:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2105454    
Bug Blocks:    

Comment 6 Benamar Mekhissi 2022-07-07 12:59:57 UTC
@mrajanna @idryomov I see the following. Any ideas???

rbd mirror image status on C2 is reporting an error:
```
ceph-tools-55b98f657d-h647k -- rbd mirror image status csi-vol-89d68c37-fc45-11ec-824c-0a580a850227 --pool ocs-storagecluster-cephblockpool
csi-vol-89d68c37-fc45-11ec-824c-0a580a850227:
  global_id:   5334c11b-2d9c-47af-9c5c-f56fc31f4407
  state:       up+error
  description: incomplete local non-primary snapshot
  service:     a on dhcp161-177.lab.eng.blr.redhat.com
  last_update: 2022-07-07 12:55:18
  peer_sites:
    name: c93bfe26-f907-4492-9bb8-f6d93fdbe5a8
    state: up+stopped
    description: local image is primary
    last_update: 2022-07-07 12:55:31
```

The VR on C2 is reporting failure to disable volume replication
```
{"level":"error","timestamp":"2022-07-07T12:25:57.113Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:198","msg":"failed to disable volume replication","Request.Name":"busybox-pvc-85","Request.Namespace":"busybox-workloads-5","error":"rpc error: code = InvalidArgument desc = secondary image status is up=true and state=error"}
```

On C1, I didn't see that we force promoted:
```
{"level":"info","timestamp":"2022-07-05T09:36:39.589Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:191","msg":"adding finalizer to PersistentVolumeClaim object","Request.Name":"busybox-pvc-85","Request.Namespace":"busybox-workloads-5","Finalizer":"replication.storage.openshift.io/pvc-protection"}
{"level":"error","timestamp":"2022-07-05T09:36:40.618Z","logger":"controllers.VolumeReplication","caller":"controllers/volumereplication_controller.go:248","msg":"failed to promote volume","Request.Name":"busybox-pvc-85","Request.Namespace":"busybox-workloads-5","error":"rpc error: code = Internal desc = ocs-storagecluster-cephblockpool/csi-vol-89d68c37-fc45-11ec-824c-0a580a850227 mirrored image is not healthy. State is up=false, state=\"unknown\""}
{"level":"error","timestamp":"2022-07-05T09:36:40.618Z","logger":"controllers.VolumeReplication","caller":"controller/controller.go:298","msg":"failed to Replicate","Request.Name":"busybox-pvc-85","Request.Namespace":"busybox-workloads-5","ReplicationState":"primary","error":"rpc error: code = Internal desc = ocs-storagecluster-cephblockpool/csi-vol-89d68c37-fc45-11ec-824c-0a580a850227 mirrored image is not healthy. State is up=false, state=\"unknown\""}
{"level":"error","timestamp":"2022-07-05T09:36:40.624Z","logger":"controller-runtime.manager.controller.volumereplication","caller":"controller/controller.go:253","msg":"Reconciler error","reconciler group":"replication.storage.openshift.io","reconciler kind":"VolumeReplication","name":"busybox-pvc-85","namespace":"busybox-workloads-5","error":"rpc error: code = Internal desc = ocs-storagecluster-cephblockpool/csi-vol-89d68c37-fc45-11ec-824c-0a580a850227 mirrored image is not healthy. State is up=false, state=\"unknown\""}
```

Comment 33 Mudit Agarwal 2022-08-11 04:58:04 UTC
Pls provide doc text

Comment 38 Mudit Agarwal 2022-08-17 13:17:31 UTC
Karthick, this means that we need to move it back to 4.11.0 and mark it ON_QA. Please confirm.

Comment 39 krishnaram Karthick 2022-08-18 14:37:28 UTC
(In reply to Mudit Agarwal from comment #38)
> Karthick, this means that we need to move it back to 4.11.0 and mark it
> ON_QA. Please confirm.

yes, you are correct. could you please move it to on_qa.

Comment 60 errata-xmlrpc 2023-01-31 00:19:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.12.0 enhancement and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:0551

Comment 61 Red Hat Bugzilla 2023-12-08 04:29:30 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days