Bug 2295404

Summary:	[MDR] virtualmachines.kubevirt.io resource fails restore due to mac allocation failure on relocate
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Parikshith <pbyregow>
Component:	odf-dr	Assignee:	Raghavendra Talur <rtalur>
odf-dr sub component:	ramen	QA Contact:	krishnaram Karthick <kramdoss>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	kbg, kseeger, muagarwa, rgowdege, rtalur
Version:	4.16
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	`virtualmachines.kubevirt.io` resource fails restore due to mac allocation failure on relocate When a virtual machine is relocated to the preferred cluster, it might fail to complete relocation due to unavailability of the mac address. This happens if the virtual machine on the preferred cluster is not fully cleaned up when it is failed over to the failover cluster. Workaround: Ensure that the workload is completely removed from the preferred cluster before relocating the workload.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-08-19 13:07:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Parikshith 2024-07-03 12:39:23 UTC

Description of problem (please be detailed as possible and provide log
snippests):
kubeobjects are not getting restored on relocate. Apps stuck in WaitForReadiness were first failedover from c1 to c2 successfully and later relocated back to c1.

openshift-dr-ops   vm-dvt-disc1                       24h   pbyregow-cl1       pbyregow-cl2      Relocate       Relocating     WaitForReadiness      2024-07-03T07:06:49Z                      False
openshift-dr-ops   vm-pvc-disc1                       24h   pbyregow-cl1       pbyregow-cl2      Relocate       Relocating     WaitForReadiness      2024-07-03T07:06:41Z                      False

PVCs exist on both managed clusters but in different states after relocate:
c1:
oc get pvc -n vm-pvc-disc1
NAME       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           VOLUMEATTRIBUTESCLASS   AGE
vm-1-pvc   Bound    pvc-6aec43fa-3939-4850-baf0-add775ee0d46   512Mi      RWX            ocs-external-storagecluster-ceph-rbd   <unset>                 5h21m

c2:
oc get pvc -n vm-pvc
-disc1
NAME       STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                           VOLUMEATTRIBUTESCLASS   AGE
vm-1-pvc   Terminating   pvc-6aec43fa-3939-4850-baf0-add775ee0d46   512Mi      RWX            ocs-external-storagecluster-ceph-rbd   <unset>                 6h36m


ramen-dr-op log on c1:
024-07-03T08:25:34.302Z ERROR controllers.VolumeReplicationGroup.vrginstance controllers/vrg_kubeobjects.go:626 Kube objects group recover error {"VolumeReplicationGroup": {"name":"vm-pvc-disc1","namespace":"openshift-dr-ops"}, "rid": "b500f96f-2c16-4a5d-ae40-0585c905235d", "State": "primary", "number": 0, "profile": "s3profile-pbyregow-cl1-ocs-external-storagecluster", "group": 0, "name": "", "error": "restorePartiallyFailed"}
github.com/ramendr/ramen/controllers.(*VRGInstance).kubeObjectsRecoveryStartOrResume
/remote-source/app/controllers/vrg_kubeobjects.go:626
github.com/ramendr/ramen/controllers.(*VRGInstance).kubeObjectsRecover
/remote-source/app/controllers/vrg_kubeobjects.go:501
github.com/ramendr/ramen/controllers.(*VRGInstance).restorePVsAndPVCsFromS3
/remote-source/app/controllers/vrg_volrep.go:1899
github.com/ramendr/ramen/controllers.(*VRGInstance).restorePVsAndPVCsForVolRep
/remote-source/app/controllers/vrg_volrep.go:1837
github.com/ramendr/ramen/controllers.(*VRGInstance).clusterDataRestore
/remote-source/app/controllers/volumereplicationgroup_controller.go:613
github.com/ramendr/ramen/controllers.(*VRGInstance).processAsPrimary
/remote-source/app/controllers/volumereplicationgroup_controller.go:888
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG
/remote-source/app/controllers/volumereplicationgroup_controller.go:561
github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile
/remote-source/app/controllers/volumereplicationgroup_controller.go:448
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:227

velero log:
time="2024-07-03T10:17:00Z" level=error msg="No BackupStorageLocations found, at least one is required" backup-storage-location=openshift-adp/openshift-dr-ops--vm-pvc-disc1--0----s3profile-pbyregow-cl1-ocs-external-storagecluster controller=backup-storage-location error="no backup storage locations found" error.file="/remote-source/velero/app/internal/storage/storagelocation.go:91" error.function=github.com/vmware-tanzu/velero/internal/storage.ListBackupStorageLocations logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:88"
time="2024-07-03T10:17:00Z" level=info msg="plugin process exited" cmd=/plugins/velero-plugin-for-aws id=122246 logSource="/remote-source/velero/app/pkg/plugin/clientmgmt/process/logrus_adapter.go:80" plugin=/plugins/velero-plugin-for-aws
time="2024-07-03T10:17:00Z" level=error msg="restore not found" error="Restore.velero.io \"openshift-dr-ops--vm-pvc-disc1--0\" not found" logSource="/remote-source/velero/app/pkg/controller/restore_finalizer_controller.go:100" restore finalizer=openshift-adp/openshift-dr-ops--vm-pvc-disc1--0

Version of all relevant components (if applicable):
OCP: 4.16.0-0.nightly-2024-06-27-091410
ODF: 4.16.0-134
ACM: 2.11.0-137
OADP: 1.4 (latest)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Seen 1/3 times, on RDR cluster not seeing this

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Created MDR cluster as per the versions listed above
2. Created discovered, managed apps(busybox/vm) on both c1 and c2
3. Fenced c1
4. Failover apps to c2 (manually cleanedup discovered apps on c1)
5. Unfence, graceful reboot c1
6. Relocate all apps back to c1.(manually cleanedup discovered apps on c2)


Actual results:
Couple of discovered apps are stuck in waitforreadiness, kubeobjects are not getting restored for these apps.

Expected results:
All apps should be relocated successfully

Additional info:
Apps which were deployed on c2 and relocated to c1 worked.