2291343 – [MDR][Tracker ACM-12228] Subscription application with VM is not failing over

Bug 2291343 - [MDR][Tracker ACM-12228] Subscription application with VM is not failing over

Summary: [MDR][Tracker ACM-12228] Subscription application with VM is not failing over

Keywords:
Status:	ASSIGNED
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	odf-dr
Sub Component:
Version:	4.16
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Nir Soffer
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-06-11 15:48 UTC by Kevin Alon Goldblatt
Modified:	2024-09-18 10:22 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	ACM-12228	0	None	None	None	2024-06-24 08:30:52 UTC

Description Kevin Alon Goldblatt 2024-06-11 15:48:02 UTC

Description of problem (please be detailed as possible and provide log
snippests):
A subscription application with a vm referencing a stand alone datavolume does not failover the datavolume, secret or vm. Only the pvc is failed over. 

Version of all relevant components (if applicable):
oc get csv -n openshift-cnv
NAME                                       DISPLAY                         VERSION            REPLACES                                   PHASE
kubevirt-hyperconverged-operator.v4.16.0   OpenShift Virtualization        4.16.0             kubevirt-hyperconverged-operator.v4.15.2   Succeeded
odr-cluster-operator.v4.16.0-90.stable     Openshift DR Cluster Operator   4.16.0-90.stable                                              Succeeded
openshift-gitops-operator.v1.12.1          Red Hat OpenShift GitOps        1.12.1             openshift-gitops-operator.v1.12.0          Succeeded
recipe.v4.16.0-90.stable                   Recipe                          4.16.0-90.stable                                              Failed
volsync-product.v0.9.1                     VolSync                         0.9.1              volsync-product.v0.9.0                     Succeeded
[cloud-user@ocp-psi-executor-xl vm16-pull-app]$ oc version
Client Version: 4.16.0-ec.5
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.16.0-ec.5
Kubernetes Version: v1.29.2+258f1d5

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Subscription App VMs with stand alone datavolumes aren't failing over

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Created the Sunscription app with VM using a stand alone datavolume - vm deployed on primary cluster
2.Applied the DR Policy
3.Accessed the VM on the primary cluster, wrote 2 text files and tail -10 /var/log/ramen.log
4.Fenced the primary cluster - vm is paused on the primary cluster
5.Failed over the subscription application >>>> VM is not failied over to the secondary cluster, still in paused state on the secondary cluster.
6.Checked the secondary cluster - the namespace was created, the pvc was failed over, public key was not failover, dv was not failed over, vm was not failed over
7.Also noted that the cdi.kubevirt.io/allowClaimAdoption: "true" annotation was not on the failed over pvc
Had successfully failed over and relocated Appset pull vm app using stand alone datavolume before this
Failing over datavolume templates worked fine. 


Actual results:
The namespace is created on the secondary cluster during failover and only the pvc is failed over. 

Expected results:
The pvc, secret, datavolume and vm should failover to the secondary cluster

Additional info:

Comment 9 Kevin Alon Goldblatt 2024-06-17 12:26:22 UTC

Update on failed subscription apps on Metro DR environment!

Hi, I have done some additional troubleshooting with manual testing around the failed Subscription applications on the Metro DR environment to understand the scope and try to get to the root cause:

I ran all variants of the subscription app:
In all cases the drpc status is Failed over and stuck on Cleaning up.
In all cases the namespace is created on the secondary 'c2' cluster and only contains the PVC.
The VM, pod and or datavolume  is not failed over.
[1]Subscription app vm using a pvc - failed failover
[2]Subscription app vm using a stand alone datavolume - failed failover (https://bugzilla.redhat.com/show_bug.cgi?id=2291343)
[3]Subscription app using a datavolume template - failed failover (Note this specific test passed a few weeks ago on this same environment!)
[4] Application set push app with PVC - Passed failover and relocate!
[5] Application set push app with stand alone datavolume - Passed failover and relocate a few weeks ago - will retest again now
[6] Application set pull app with datavolume template - Passed failover and relocate a few weeks ago - will retest again now

I used the scenario below in each test:
Created the Subscription pvc/dv/dvt application - deployed successfully on the primary cluster
Accessed the VM and wrote a text file and ran 'tail -10 /var/log/ramen.log'
Enrolled the Subscription application to the DRPolicy
Fenced the primary cluster
Failed over the Subscription application >>>>> Only the pvc is created, no pods and no vm is created

Conclusion:
[a]It seems We have a specific issue with failing over Subscription applications on our MDR environment
[b]Subscription applications with datavolume templates Passed a few weeks ago on this same environment - retested and now it is failing!
[c]Application push apps are passing failover and relocate.
[d]So it seems that something has broken on our MDR environment!
[e]Could this be another issue with the S3/secrets configuration being deleted when the operators are upgraded as we saw on the Regional DR
environment?

Comment 17 Sunil Kumar Acharya 2024-06-25 12:09:21 UTC

Please update the RDT flag/text appropriately.

Note You need to log in before you can comment on or make changes to this bug.