Bug 2121159

Summary:	[RDR] Relocate Operation of workload did not started
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Pratik Surve <prsurve>
Component:	odf-dr	Assignee:	Shyamsundar <srangana>
odf-dr sub component:	ramen	QA Contact:	krishnaram Karthick <kramdoss>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	amagrawa, madam, muagarwa, ocs-bugs, odf-bz-bot
Version:	4.11	Keywords:	Regression
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-25 11:57:12 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pratik Surve 2022-08-24 16:48:40 UTC

Description of problem (please be detailed as possible and provide log
snippests):
[RDR] Relocate Operation of workload did not started

Version of all relevant components (if applicable):

OCP version:- 4.11.0-0.nightly-2022-08-04-081314
ODF version:- 4.11.0-133
CEPH version:- ceph version 16.2.8-84.el8cp (c2980f2fd700e979d41b4bad2939bb90f0fe435c) pacific (stable)
ACM version:- 2.5.2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy RDR Application
2. Perform relocate operation in some days 
3.


Actual results:
Relocate operation did not started

Expected results:
Relocate operation should work

Additional info:


"busybox-workloads-3" is the namespace for which the relocation operation was triggered

Comment 3 Shyamsundar 2022-08-24 21:00:59 UTC

DRPC for busybox-workload-3 is reporting ClusterDataProtected as false:

busybox-workloads-3   busybox-drpc   13d   prsurve-vm-d                         Relocate       Initiating
    - lastTransitionTime: "2022-08-24T14:57:07Z"
      message: Cluster data of one or more PVs are unprotected
      observedGeneration: 2
      reason: UploadError
      status: "False"
      type: ClusterDataProtected

The corresponding VRG is reporting PV upload errors as well
        - lastTransitionTime: "2022-08-24T14:56:59Z"
          message: error getting object store, failed to protect cluster data for
            PVC busybox-pvc-41, persistent error while uploading to s3 profile s3profile-vmware-dccp-one-ocs-storagecluster,
            will retry later
          observedGeneration: 2
          reason: UploadError
          status: "False"
          type: ClusterDataProtected

As a result the relocate is not progressing. From the PlacementRule and VR status, the workload is still running on vmware-dccp-one due to the above, and relocate is stuck as stated.

Here is an error log from VRG for busybox-workloads-2 (which also reports ClusterDataProtected as false):
2022-08-24T14:41:05.846334201Z 1.6613520658462367e+09	DEBUG	events	record/event.go:311	Warning	{"object": {"kind":"VolumeReplicationGroup","namespace":"busybox-workloads-2","name":"busybox-drpc","uid":"ea77ab0b-6090-403e-b5a6-2690ed3948ca","apiVersion":"ramendr.openshift.io/v1alpha1","resourceVersion":"42862598"}, "reason": "PVUploadFailed", "message": "error uploading PV to s3Profile s3profile-vmware-dccp-one-ocs-storagecluster, failed to protect cluster data for PVC busybox-pvc-26, failed to upload data of odrbucket-ebd82b72a969:busybox-workloads-2/busybox-drpc/v1.PersistentVolume/pvc-e422ec54-655e-41f2-9a18-9b400c80715d, SerializationError: failed to unmarshal error message\n\tstatus code: 504, request id: , host id: \ncaused by: UnmarshalError: failed to unmarshal error message\n\t00000000  3c 68 74 6d 6c 3e 3c 62  6f 64 79 3e 3c 68 31 3e  |<html><body><h1>|\n00000010  35 30 34 20 47 61 74 65  77 61 79 20 54 69 6d 65  |504 Gateway Time|\n00000020  2d 6f 75 74 3c 2f 68 31  3e 0a 54 68 65 20 73 65  |-out</h1>.The se|\n00000030  72 76 65 72 20 64 69 64  6e 27 74 20 72 65 73 70  |rver didn't resp|\n00000040  6f 6e 64 20 69 6e 20 74  69 6d 65 2e 0a 3c 2f 62  |ond in time..</b|\n00000050  6f 64 79 3e 3c 2f 68 74  6d 6c 3e 0a              |ody></html>.|\n\ncaused by: expected element type <Error> but have <html>"}

The core error seems to be the following:
2022-08-24T15:01:40.566331893Z caused by: UnmarshalError: failed to unmarshal error message
2022-08-24T15:01:40.566331893Z 	00000000  3c 68 74 6d 6c 3e 3c 62  6f 64 79 3e 3c 68 31 3e  |<html><body><h1>|
2022-08-24T15:01:40.566331893Z 00000010  35 30 34 20 47 61 74 65  77 61 79 20 54 69 6d 65  |504 Gateway Time|
2022-08-24T15:01:40.566331893Z 00000020  2d 6f 75 74 3c 2f 68 31  3e 0a 54 68 65 20 73 65  |-out</h1>.The se|
2022-08-24T15:01:40.566331893Z 00000030  72 76 65 72 20 64 69 64  6e 27 74 20 72 65 73 70  |rver didn't resp|
2022-08-24T15:01:40.566331893Z 00000040  6f 6e 64 20 69 6e 20 74  69 6d 65 2e 0a 3c 2f 62  |ond in time..</b|
2022-08-24T15:01:40.566331893Z 00000050  6f 64 79 3e 3c 2f 68 74  6d 6c 3e 0a              |ody></html>.|

IOW, the upload is timing out as the S3 end point is not reachable. It is unable to reach the S3 profile s3profile-vmware-dccp-one-ocs-storagecluster.

Nooba logs on the server have the following errors:

NOTE: Unsure if this is the cause of the problem.

time="2022-08-24T15:51:03Z" level=info msg="Start BucketClass Reconcile..." bucketclass=openshift-storage/noobaa-default-bucket-class
time="2022-08-24T15:51:03Z" level=info msg="✅ Exists: NooBaa \"noobaa\"\n"
time="2022-08-24T15:51:03Z" level=info msg="✅ Exists: BucketClass \"noobaa-default-bucket-class\"\n"
time="2022-08-24T15:51:03Z" level=info msg="SetPhase: Verifying" bucketclass=openshift-storage/noobaa-default-bucket-class
time="2022-08-24T15:51:03Z" level=info msg="RPC: Ping (0xc001900240) &{RPC:0xc000119220 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:connected WS:0xc00193c700 PendingRequests:map[wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/-8194:0xc00045d830 wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/-8293:0xc0007f6480 wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/-8294:0xc001017dd0 wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/-8295:0xc001947590] NextRequestID:8296 Lock:{state:0 sema:0} ReconnectDelay:0s cancelPings:0x5ed060}"
time="2022-08-24T15:51:03Z" level=info msg="✅ Exists: BackingStore \"noobaa-default-backing-store\"\n"
time="2022-08-24T15:51:03Z" level=info msg="SetPhase: temporary error during phase \"Verifying\"" bucketclass=openshift-storage/noobaa-default-bucket-class
time="2022-08-24T15:51:03Z" level=warning msg="⏳ Temporary Error: NooBaa BackingStore \"noobaa-default-backing-store\" is not yet ready" bucketclass=openshift-storage/noobaa-default-bucket-class
time="2022-08-24T15:51:03Z" level=info msg="UpdateStatus: Done" bucketclass=openshift-storage/noobaa-default-bucket-class

@prsurve Can you ensure Nooba health is as desired? It seems to be that the S3 store is having issues causing the said problem.

Comment 4 Pratik Surve 2022-08-25 11:57:12 UTC

Closing the bug as not a bug 

Looking at the system i found that the backing store was down

$oc get backingstores.noobaa.io
NAME                           TYPE            PHASE        AGE
noobaa-default-backing-store   s3-compatible   Connecting   14d


to resolve it I deleted the noobaa-core-0 pod and the issue got resolved and relocate operation was completed successfully.

$oc get drpc 

busybox-workloads-3   busybox-drpc   14d   prsurve-vm-d                         Relocate       Relocated