Bug 2295404
| Summary: | [MDR] virtualmachines.kubevirt.io resource fails restore due to mac allocation failure on relocate | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Parikshith <pbyregow> |
| Component: | odf-dr | Assignee: | Raghavendra Talur <rtalur> |
| odf-dr sub component: | ramen | QA Contact: | krishnaram Karthick <kramdoss> |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | kbg, kseeger, muagarwa, rgowdege, rtalur |
| Version: | 4.16 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Known Issue | |
| Doc Text: |
`virtualmachines.kubevirt.io` resource fails restore due to mac allocation failure on relocate
When a virtual machine is relocated to the preferred cluster, it might fail to complete relocation due to unavailability of the mac address. This happens if the virtual machine on the preferred cluster is not fully cleaned up when it is failed over to the failover cluster.
Workaround: Ensure that the workload is completely removed from the preferred cluster before relocating the workload.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-08-19 13:07:20 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Description of problem (please be detailed as possible and provide log snippests): kubeobjects are not getting restored on relocate. Apps stuck in WaitForReadiness were first failedover from c1 to c2 successfully and later relocated back to c1. openshift-dr-ops vm-dvt-disc1 24h pbyregow-cl1 pbyregow-cl2 Relocate Relocating WaitForReadiness 2024-07-03T07:06:49Z False openshift-dr-ops vm-pvc-disc1 24h pbyregow-cl1 pbyregow-cl2 Relocate Relocating WaitForReadiness 2024-07-03T07:06:41Z False PVCs exist on both managed clusters but in different states after relocate: c1: oc get pvc -n vm-pvc-disc1 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE vm-1-pvc Bound pvc-6aec43fa-3939-4850-baf0-add775ee0d46 512Mi RWX ocs-external-storagecluster-ceph-rbd <unset> 5h21m c2: oc get pvc -n vm-pvc -disc1 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE vm-1-pvc Terminating pvc-6aec43fa-3939-4850-baf0-add775ee0d46 512Mi RWX ocs-external-storagecluster-ceph-rbd <unset> 6h36m ramen-dr-op log on c1: 024-07-03T08:25:34.302Z ERROR controllers.VolumeReplicationGroup.vrginstance controllers/vrg_kubeobjects.go:626 Kube objects group recover error {"VolumeReplicationGroup": {"name":"vm-pvc-disc1","namespace":"openshift-dr-ops"}, "rid": "b500f96f-2c16-4a5d-ae40-0585c905235d", "State": "primary", "number": 0, "profile": "s3profile-pbyregow-cl1-ocs-external-storagecluster", "group": 0, "name": "", "error": "restorePartiallyFailed"} github.com/ramendr/ramen/controllers.(*VRGInstance).kubeObjectsRecoveryStartOrResume /remote-source/app/controllers/vrg_kubeobjects.go:626 github.com/ramendr/ramen/controllers.(*VRGInstance).kubeObjectsRecover /remote-source/app/controllers/vrg_kubeobjects.go:501 github.com/ramendr/ramen/controllers.(*VRGInstance).restorePVsAndPVCsFromS3 /remote-source/app/controllers/vrg_volrep.go:1899 github.com/ramendr/ramen/controllers.(*VRGInstance).restorePVsAndPVCsForVolRep /remote-source/app/controllers/vrg_volrep.go:1837 github.com/ramendr/ramen/controllers.(*VRGInstance).clusterDataRestore /remote-source/app/controllers/volumereplicationgroup_controller.go:613 github.com/ramendr/ramen/controllers.(*VRGInstance).processAsPrimary /remote-source/app/controllers/volumereplicationgroup_controller.go:888 github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG /remote-source/app/controllers/volumereplicationgroup_controller.go:561 github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile /remote-source/app/controllers/volumereplicationgroup_controller.go:448 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:119 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:316 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:266 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:227 velero log: time="2024-07-03T10:17:00Z" level=error msg="No BackupStorageLocations found, at least one is required" backup-storage-location=openshift-adp/openshift-dr-ops--vm-pvc-disc1--0----s3profile-pbyregow-cl1-ocs-external-storagecluster controller=backup-storage-location error="no backup storage locations found" error.file="/remote-source/velero/app/internal/storage/storagelocation.go:91" error.function=github.com/vmware-tanzu/velero/internal/storage.ListBackupStorageLocations logSource="/remote-source/velero/app/pkg/controller/backup_storage_location_controller.go:88" time="2024-07-03T10:17:00Z" level=info msg="plugin process exited" cmd=/plugins/velero-plugin-for-aws id=122246 logSource="/remote-source/velero/app/pkg/plugin/clientmgmt/process/logrus_adapter.go:80" plugin=/plugins/velero-plugin-for-aws time="2024-07-03T10:17:00Z" level=error msg="restore not found" error="Restore.velero.io \"openshift-dr-ops--vm-pvc-disc1--0\" not found" logSource="/remote-source/velero/app/pkg/controller/restore_finalizer_controller.go:100" restore finalizer=openshift-adp/openshift-dr-ops--vm-pvc-disc1--0 Version of all relevant components (if applicable): OCP: 4.16.0-0.nightly-2024-06-27-091410 ODF: 4.16.0-134 ACM: 2.11.0-137 OADP: 1.4 (latest) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? Seen 1/3 times, on RDR cluster not seeing this Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Created MDR cluster as per the versions listed above 2. Created discovered, managed apps(busybox/vm) on both c1 and c2 3. Fenced c1 4. Failover apps to c2 (manually cleanedup discovered apps on c1) 5. Unfence, graceful reboot c1 6. Relocate all apps back to c1.(manually cleanedup discovered apps on c2) Actual results: Couple of discovered apps are stuck in waitforreadiness, kubeobjects are not getting restored for these apps. Expected results: All apps should be relocated successfully Additional info: Apps which were deployed on c2 and relocated to c1 worked.