Bug 2214306 - [MDR][RDR]: Application failover are hanged in "FailingOver" state when the managed clusters are on different versions of OCP and ODF.
Summary: [MDR][RDR]: Application failover are hanged in "FailingOver" state when the m...
Keywords:
Status: CLOSED DUPLICATE of bug 2215462
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.13
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Raghavendra Talur
QA Contact: krishnaram Karthick
URL:
Whiteboard:
Depends On:
Blocks: 2154341
TreeView+ depends on / blocked
 
Reported: 2023-06-12 14:54 UTC by akarsha
Modified: 2023-08-09 17:00 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
Disaster Recovery solution with OpenShift Data Foundation 4.13 protects and restores Persistent Volume Claim (PVC) data in addition to the Persistent Volume (PV) data. If the primary cluster is on an older OpenShift Data Foundation version and the target cluster is updated to 4.13 then the failover will be stuck as the S3 store will not have the PVC data. When upgrading the Disaster Recovery clusters, the primary cluster has to be upgraded first and then the post-upgrade steps must be run.
Clone Of:
Environment:
Last Closed: 2023-08-08 15:56:26 UTC
Embargoed:


Attachments (Terms of Use)

Description akarsha 2023-06-12 14:54:32 UTC
Description of problem (please be detailed as possible and provide log
snippests):

Upgrade one of the managed cluster say c1(both OCP and ODF), perform failover of applications from c2 to c1 (that is when the managed clusters are on different versions of OCP and ODF perform failover of an application).
Here c2 managed cluster is using OCP:4.12-nightly-build, ODF: 4.12.3 and c1 is upgraded to OCP: 4.13-nightly-build, ODF: 4.13-latest-rc

When trying to failover applications (helloworld-c2, cronjob-c2, bs-1) from c2 to c1, failover is hanged in "FailingOver" state with below error

 message: "Failed to restore PVs (failed to restore ClusterData for VolRep (failed
      to restore PVs and PVCs using profile list ([s3profile-akrai-c1-ocs-external-storagecluster
      s3profile-akrai-c2-ocs-external-storagecluster]): unable to ListKeys of type
      v1.PersistentVolume keyPrefix helloworld-c2/helloworld-c2-placement-1-drpc/v1.PersistentVolume/,
      failed to list objects in bucket odrbucket-67670dd10b7c:helloworld-c2/helloworld-c2-placement-1-drpc/v1.PersistentVolume/,
      InternalError: We encountered an internal error. Please try again.\n\tstatus
      code: 500, request id: lisp29p4-4gcur5-6o1, host id: lisp29p4-4gcur5-6o1))"
    observedGeneration: 1
    
$ date; date --utc; oc get drpc -A -owide | grep -i FailingOver
Monday 12 June 2023 04:05:53 PM IST
Monday 12 June 2023 10:35:53 AM UTC
bs-1            bs-1-placement-1-drpc            4h43m   akrai-c2           akrai-c1          Failover       FailingOver    WaitingForPVRestore   2023-06-12T10:00:40Z                     False
cronjob-c2      cronjob-c2-placement-1-drpc      4h43m   akrai-c2           akrai-c1          Failover       FailingOver    WaitingForPVRestore   2023-06-12T10:00:52Z                     False
helloworld-c2   helloworld-c2-placement-1-drpc   4h42m   akrai-c2           akrai-c1          Failover       FailingOver    WaitingForPVRestore   2023-06-12T10:01:05Z                     False

Version of all relevant components (if applicable):

c1 managed cluster upgraded version:
OCP: 4.13.0-0.nightly-2023-06-09-152551
ODF: 4.13.0-rhodf (latest rc build)

C2 and hub clusters
OCP: 4.12.0-0.nightly-2023-06-08-063126
ODF: 4.12.3-rhodf

ACM: 2.7.4
CEPH: 17.2.6-70.el9cp (fe62dcdbb2c6e05782a3e2b67d025b84ff5047cc) quincy (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes, failover application cannot be done

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
1/1

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create 4 OCP clusters such that 2 hubs and 2 managed clusters. And one stretched RHCS cluster.
   Deploy cluster in such a way that
	zone a: arbiter ceph node
	zone b: c1, h1, 3 ceph nodes
	zone c: c2, h2, 3 ceph nodes
   Deployed cluster using version: 
   OCP : 4.12.0-0.nightly-2023-06-08-063126
   ODF: 4.12.3-rhodf
2. Configure MDR and deploy 10 applications on each managed clusters
3. Upgrade c1 managed cluster, OCP to 4.13.0-0.nightly-2023-06-09-152551 and ODF to 4.13.0-rhodf (latest rc build)
4. Perform failover and failback of applications from c1 to c2, which succeeded
5. Perform failover of applications from c2 to c1, which is hanged in "FaillingOver" state


Actual results:
Application failover are hanged in "FailingOver" state when the managed clusters are on different versions of OCP and ODF.

Expected results:
Application failover and failback of applications, should succeed.


Additional info:

Comment 4 Elad 2023-06-15 09:04:55 UTC
Proposing as a blocker for 4.13.0 due to the recent news that Metro DR functionalities, such as failover and failback are broken post ugprade to 4.13.0

Comment 5 Harish NV Rao 2023-06-16 06:12:48 UTC
Retaining original severity and removing the blocker flag as comment 3 is now tracked in the new bz: https://bugzilla.redhat.com/show_bug.cgi?id=2215462 which is a blocker for 4.13.0

We would like to retain this bz for the original issue "Application failover are hanged in "FailingOver" state when the managed clusters are on different versions of OCP and ODF"


Note You need to log in before you can comment on or make changes to this bug.