Description of problem (please be detailed as possible and provide log snippests): [RDR] Relocate Operation of workload did not started Version of all relevant components (if applicable): OCP version:- 4.11.0-0.nightly-2022-08-04-081314 ODF version:- 4.11.0-133 CEPH version:- ceph version 16.2.8-84.el8cp (c2980f2fd700e979d41b4bad2939bb90f0fe435c) pacific (stable) ACM version:- 2.5.2 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Deploy RDR Application 2. Perform relocate operation in some days 3. Actual results: Relocate operation did not started Expected results: Relocate operation should work Additional info: "busybox-workloads-3" is the namespace for which the relocation operation was triggered
DRPC for busybox-workload-3 is reporting ClusterDataProtected as false: busybox-workloads-3 busybox-drpc 13d prsurve-vm-d Relocate Initiating - lastTransitionTime: "2022-08-24T14:57:07Z" message: Cluster data of one or more PVs are unprotected observedGeneration: 2 reason: UploadError status: "False" type: ClusterDataProtected The corresponding VRG is reporting PV upload errors as well - lastTransitionTime: "2022-08-24T14:56:59Z" message: error getting object store, failed to protect cluster data for PVC busybox-pvc-41, persistent error while uploading to s3 profile s3profile-vmware-dccp-one-ocs-storagecluster, will retry later observedGeneration: 2 reason: UploadError status: "False" type: ClusterDataProtected As a result the relocate is not progressing. From the PlacementRule and VR status, the workload is still running on vmware-dccp-one due to the above, and relocate is stuck as stated. Here is an error log from VRG for busybox-workloads-2 (which also reports ClusterDataProtected as false): 2022-08-24T14:41:05.846334201Z 1.6613520658462367e+09 DEBUG events record/event.go:311 Warning {"object": {"kind":"VolumeReplicationGroup","namespace":"busybox-workloads-2","name":"busybox-drpc","uid":"ea77ab0b-6090-403e-b5a6-2690ed3948ca","apiVersion":"ramendr.openshift.io/v1alpha1","resourceVersion":"42862598"}, "reason": "PVUploadFailed", "message": "error uploading PV to s3Profile s3profile-vmware-dccp-one-ocs-storagecluster, failed to protect cluster data for PVC busybox-pvc-26, failed to upload data of odrbucket-ebd82b72a969:busybox-workloads-2/busybox-drpc/v1.PersistentVolume/pvc-e422ec54-655e-41f2-9a18-9b400c80715d, SerializationError: failed to unmarshal error message\n\tstatus code: 504, request id: , host id: \ncaused by: UnmarshalError: failed to unmarshal error message\n\t00000000 3c 68 74 6d 6c 3e 3c 62 6f 64 79 3e 3c 68 31 3e |<html><body><h1>|\n00000010 35 30 34 20 47 61 74 65 77 61 79 20 54 69 6d 65 |504 Gateway Time|\n00000020 2d 6f 75 74 3c 2f 68 31 3e 0a 54 68 65 20 73 65 |-out</h1>.The se|\n00000030 72 76 65 72 20 64 69 64 6e 27 74 20 72 65 73 70 |rver didn't resp|\n00000040 6f 6e 64 20 69 6e 20 74 69 6d 65 2e 0a 3c 2f 62 |ond in time..</b|\n00000050 6f 64 79 3e 3c 2f 68 74 6d 6c 3e 0a |ody></html>.|\n\ncaused by: expected element type <Error> but have <html>"} The core error seems to be the following: 2022-08-24T15:01:40.566331893Z caused by: UnmarshalError: failed to unmarshal error message 2022-08-24T15:01:40.566331893Z 00000000 3c 68 74 6d 6c 3e 3c 62 6f 64 79 3e 3c 68 31 3e |<html><body><h1>| 2022-08-24T15:01:40.566331893Z 00000010 35 30 34 20 47 61 74 65 77 61 79 20 54 69 6d 65 |504 Gateway Time| 2022-08-24T15:01:40.566331893Z 00000020 2d 6f 75 74 3c 2f 68 31 3e 0a 54 68 65 20 73 65 |-out</h1>.The se| 2022-08-24T15:01:40.566331893Z 00000030 72 76 65 72 20 64 69 64 6e 27 74 20 72 65 73 70 |rver didn't resp| 2022-08-24T15:01:40.566331893Z 00000040 6f 6e 64 20 69 6e 20 74 69 6d 65 2e 0a 3c 2f 62 |ond in time..</b| 2022-08-24T15:01:40.566331893Z 00000050 6f 64 79 3e 3c 2f 68 74 6d 6c 3e 0a |ody></html>.| IOW, the upload is timing out as the S3 end point is not reachable. It is unable to reach the S3 profile s3profile-vmware-dccp-one-ocs-storagecluster. Nooba logs on the server have the following errors: NOTE: Unsure if this is the cause of the problem. time="2022-08-24T15:51:03Z" level=info msg="Start BucketClass Reconcile..." bucketclass=openshift-storage/noobaa-default-bucket-class time="2022-08-24T15:51:03Z" level=info msg="✅ Exists: NooBaa \"noobaa\"\n" time="2022-08-24T15:51:03Z" level=info msg="✅ Exists: BucketClass \"noobaa-default-bucket-class\"\n" time="2022-08-24T15:51:03Z" level=info msg="SetPhase: Verifying" bucketclass=openshift-storage/noobaa-default-bucket-class time="2022-08-24T15:51:03Z" level=info msg="RPC: Ping (0xc001900240) &{RPC:0xc000119220 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:connected WS:0xc00193c700 PendingRequests:map[wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/-8194:0xc00045d830 wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/-8293:0xc0007f6480 wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/-8294:0xc001017dd0 wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/-8295:0xc001947590] NextRequestID:8296 Lock:{state:0 sema:0} ReconnectDelay:0s cancelPings:0x5ed060}" time="2022-08-24T15:51:03Z" level=info msg="✅ Exists: BackingStore \"noobaa-default-backing-store\"\n" time="2022-08-24T15:51:03Z" level=info msg="SetPhase: temporary error during phase \"Verifying\"" bucketclass=openshift-storage/noobaa-default-bucket-class time="2022-08-24T15:51:03Z" level=warning msg="⏳ Temporary Error: NooBaa BackingStore \"noobaa-default-backing-store\" is not yet ready" bucketclass=openshift-storage/noobaa-default-bucket-class time="2022-08-24T15:51:03Z" level=info msg="UpdateStatus: Done" bucketclass=openshift-storage/noobaa-default-bucket-class @prsurve Can you ensure Nooba health is as desired? It seems to be that the S3 store is having issues causing the said problem.
Closing the bug as not a bug Looking at the system i found that the backing store was down $oc get backingstores.noobaa.io NAME TYPE PHASE AGE noobaa-default-backing-store s3-compatible Connecting 14d to resolve it I deleted the noobaa-core-0 pod and the issue got resolved and relocate operation was completed successfully. $oc get drpc busybox-workloads-3 busybox-drpc 14d prsurve-vm-d Relocate Relocated