Bug 2282284 - [RDR][Discovered Apps] ramen proceeds to recover kubeObjects from a capture even if the capture is invalid
Summary: [RDR][Discovered Apps] ramen proceeds to recover kubeObjects from a capture e...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: odf-dr
Version: 4.16
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.16.0
Assignee: Nir Soffer
QA Contact: Pratik Surve
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-05-22 04:34 UTC by Pratik Surve
Modified: 2024-07-17 13:23 UTC (History)
4 users (show)

Fixed In Version: 4.16.0-110
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2024-07-17 13:23:33 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github RamenDR ramen pull 1414 0 None Merged Don't cleanup nil request 2024-05-23 11:27:12 UTC
Github red-hat-storage ramen pull 274 0 None open Bug 2282284: Don't cleanup nil request 2024-05-23 11:28:18 UTC
Github red-hat-storage ramen pull 275 0 None open Bug 2282284: Help pylint with drenv.commands 2024-05-23 14:46:48 UTC
Red Hat Product Errata RHSA-2024:4591 0 None None None 2024-07-17 13:23:35 UTC

Description Pratik Surve 2024-05-22 04:34:18 UTC
Description of problem (please be detailed as possible and provide log
snippests):
[RDR][Discovered Apps] ramen proceeds to recover kubeObjects from a capture even if the capture is invalid

Version of all relevant components (if applicable):

OCP version:- 4.16.0-0.nightly-2024-05-19-083311
ODF version:- 4.16.0-102
CEPH version:- ceph version 18.2.1-167.el9cp (e8c836edb24adb7717a6c8ba1e93a07e3efede29) reef (stable)
ACM version:- 2.11.0-86
SUBMARINER version:- v0.18.0
VOLSYNC version:- volsync-product.v0.9.0
VOLSYNC method:- destinationCopyMethod: Direct

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to reproduce:
1. Deploy RDR cluster 
2. Deploy discovered app workload and dr protect it 
3. Perform a failover operation 


Actual results:
2024-05-21T14:13:20.185Z	INFO	controllers.VolumeReplicationGroup.vrginstance	velero/requests.go:657	Backup	{"VolumeReplicationGroup": {"name":"busybox-disc-rbd-1","namespace":"openshift-dr-ops"}, "rid": "17ae1901-a8fc-4b74-8d5b-411ae8741507", "State": "primary", "phase": "FailedValidation", "warnings": 0, "errors": 0, "failure": "", "validation errors": ["an existing backup storage location wasn't specified at backup creation time and the default 'openshift-dr-ops--busybox-disc-rbd-1--1----s3profile-prsurve-c1-ocs-storagecluster' wasn't found. Please address this issue (see `velero backup-location -h` for options) and create a new backup. Error: BackupStorageLocation.velero.io \"openshift-dr-ops--busybox-disc-rbd-1--1----s3profile-prsurve-c1-ocs-storagecluster\" not found"]}
2024-05-21T14:13:20.185Z	ERROR	controllers.VolumeReplicationGroup.vrginstance	controllers/vrg_kubeobjects.go:611	Kube objects group recover error	{"VolumeReplicationGroup": {"name":"busybox-disc-rbd-1","namespace":"openshift-dr-ops"}, "rid": "17ae1901-a8fc-4b74-8d5b-411ae8741507", "State": "primary", "number": 1, "profile": "s3profile-prsurve-c1-ocs-storagecluster", "group": 0, "name": "", "error": "backupFailedValidation"}
github.com/ramendr/ramen/controllers.(*VRGInstance).kubeObjectsRecoveryStartOrResume
	/remote-source/app/controllers/vrg_kubeobjects.go:611
github.com/ramendr/ramen/controllers.(*VRGInstance).kubeObjectsRecover
	/remote-source/app/controllers/vrg_kubeobjects.go:494
github.com/ramendr/ramen/controllers.(*VRGInstance).restorePVsAndPVCsFromS3
	/remote-source/app/controllers/vrg_volrep.go:1899
github.com/ramendr/ramen/controllers.(*VRGInstance).restorePVsAndPVCsForVolRep
	/remote-source/app/controllers/vrg_volrep.go:1837
github.com/ramendr/ramen/controllers.(*VRGInstance).clusterDataRestore
	/remote-source/app/controllers/volumereplicationgroup_controller.go:603
github.com/ramendr/ramen/controllers.(*VRGInstance).processAsPrimary
	/remote-source/app/controllers/volumereplicationgroup_controller.go:872
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG
	/remote-source/app/controllers/volumereplicationgroup_controller.go:551
github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile
	/remote-source/app/controllers/volumereplicationgroup_controller.go:438
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:227
2024-05-21T14:13:20.185Z	INFO	controllers.VolumeReplicationGroup.vrginstance	runtime/panic.go:914	Exiting processing VolumeReplicationGroup	{"VolumeReplicationGroup": {"name":"busybox-disc-rbd-1","namespace":"openshift-dr-ops"}, "rid": "17ae1901-a8fc-4b74-8d5b-411ae8741507", "State": "primary"}
2024-05-21T14:13:20.185Z	INFO	controllers.VolumeReplicationGroup	runtime/panic.go:914	Exiting reconcile loop	{"VolumeReplicationGroup": {"name":"busybox-disc-rbd-1","namespace":"openshift-dr-ops"}, "rid": "17ae1901-a8fc-4b74-8d5b-411ae8741507"}
2024-05-21T14:13:20.185Z	INFO	controller/controller.go:115	Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference	{"controller": "volumereplicationgroup", "controllerGroup": "ramendr.openshift.io", "controllerKind": "VolumeReplicationGroup", "VolumeReplicationGroup": {"name":"busybox-disc-rbd-1","namespace":"openshift-dr-ops"}, "namespace": "openshift-dr-ops", "name": "busybox-disc-rbd-1", "reconcileID": "22361a5c-d9e4-4b96-96fe-505f3fcdfa75"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x19b708d]

goroutine 405 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:116 +0x1e5
panic({0x1bc6c80?, 0x3313c30?})
	/usr/lib/golang/src/runtime/panic.go:914 +0x21f
github.com/ramendr/ramen/controllers.(*VRGInstance).getRecoverOrProtectRequest.func4({0x0, 0x0})
	/remote-source/app/controllers/vrg_kubeobjects.go:561 +0x8d
github.com/ramendr/ramen/controllers.(*VRGInstance).kubeObjectsRecoveryStartOrResume(0xc0006e63c0, 0xc0006e6580, {{0x23de158, 0xc00392f260}, {{0xc003c13a10, 0x27}, {0xc003af1c50, 0x16}, {0xc003ba1fc0, 0x3a}, ...}}, ...)
	/remote-source/app/controllers/vrg_kubeobjects.go:612 +0x72d
github.com/ramendr/ramen/controllers.(*VRGInstance).kubeObjectsRecover(0xc0006e63c0, 0x54a?, {{0xc003c13a10, 0x27}, {0xc003af1c50, 0x16}, {0xc003ba1fc0, 0x3a}, {0xc003c681a0, 0x6}, ...}, ...)
	/remote-source/app/controllers/vrg_kubeobjects.go:494 +0x5dc
github.com/ramendr/ramen/controllers.(*VRGInstance).restorePVsAndPVCsFromS3(0xc0006e63c0, 0xc0006e6580)
	/remote-source/app/controllers/vrg_volrep.go:1899 +0x638
github.com/ramendr/ramen/controllers.(*VRGInstance).restorePVsAndPVCsForVolRep(0xc0006e63c0, 0xc0006e6580)
	/remote-source/app/controllers/vrg_volrep.go:1837 +0x10e
github.com/ramendr/ramen/controllers.(*VRGInstance).clusterDataRestore(0xc0006e63c0, 0xc003bae930?)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:603 +0x130
github.com/ramendr/ramen/controllers.(*VRGInstance).processAsPrimary(0xc0006e63c0)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:872 +0x105
github.com/ramendr/ramen/controllers.(*VRGInstance).processVRG(0xc0006e63c0)
	/remote-source/app/controllers/volumereplicationgroup_controller.go:551 +0x630
github.com/ramendr/ramen/controllers.(*VolumeReplicationGroupReconciler).Reconcile(0xc0006d3500, {0x23db4a0?, 0xc000cb01e0}, {{{0xc000b014b0, 0x10}, {0xc000b1e780, 0x12}}})
	/remote-source/app/controllers/volumereplicationgroup_controller.go:438 +0xae5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x23dedb8?, {0x23db4a0?, 0xc000cb01e0?}, {{{0xc000b014b0?, 0xb?}, {0xc000b1e780?, 0x0?}}})
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:119 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0000c0a00, {0x23db4d8, 0xc0007cc410}, {0x1c72840?, 0xc001704d20?})
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:316 +0x3cc
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0000c0a00, {0x23db4d8, 0xc0007cc410})
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:266 +0x1c9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:227 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 90
	/remote-source/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime.3/pkg/internal/controller/controller.go:223 +0x565

Expected results:


Additional info:
Deploying and failover operator was done using automation

Comment 5 Nir Soffer 2024-05-22 11:35:11 UTC
Pratik, you did not describe the problem and the expected results.

The title give some hints:
"ramen proceeds to recover kubeObjects from a capture even if the capture is invalid"

But there is not detail on the invalid capture and what is means.

In the "actual results" you show that ramen was terminated after dereferencing a nil pointer,
this should never happen and easy to fix for this code path.

For "Is this reproducible" you answer Yes. This is not detailed enough. We need to understand if this
is a random error or it happens every time. Reporting how many runs you did and how many runs failed
will help.

Please complete:

- Description of the issue
- Expected results
- How many times did you test/how many time it failed
- Complete configuration for reproducing this issue in another system

Comment 6 Nir Soffer 2024-05-22 18:03:22 UTC
We think we understand the problem and it is fixed now. No more info needed.

Comment 14 errata-xmlrpc 2024-07-17 13:23:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.16.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:4591


Note You need to log in before you can comment on or make changes to this bug.