Bug 2121159
| Summary: | [RDR] Relocate Operation of workload did not started | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Pratik Surve <prsurve> |
| Component: | odf-dr | Assignee: | Shyamsundar <srangana> |
| odf-dr sub component: | ramen | QA Contact: | krishnaram Karthick <kramdoss> |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | amagrawa, madam, muagarwa, ocs-bugs, odf-bz-bot |
| Version: | 4.11 | Keywords: | Regression |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-25 11:57:12 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Pratik Surve
2022-08-24 16:48:40 UTC
DRPC for busybox-workload-3 is reporting ClusterDataProtected as false:
busybox-workloads-3 busybox-drpc 13d prsurve-vm-d Relocate Initiating
- lastTransitionTime: "2022-08-24T14:57:07Z"
message: Cluster data of one or more PVs are unprotected
observedGeneration: 2
reason: UploadError
status: "False"
type: ClusterDataProtected
The corresponding VRG is reporting PV upload errors as well
- lastTransitionTime: "2022-08-24T14:56:59Z"
message: error getting object store, failed to protect cluster data for
PVC busybox-pvc-41, persistent error while uploading to s3 profile s3profile-vmware-dccp-one-ocs-storagecluster,
will retry later
observedGeneration: 2
reason: UploadError
status: "False"
type: ClusterDataProtected
As a result the relocate is not progressing. From the PlacementRule and VR status, the workload is still running on vmware-dccp-one due to the above, and relocate is stuck as stated.
Here is an error log from VRG for busybox-workloads-2 (which also reports ClusterDataProtected as false):
2022-08-24T14:41:05.846334201Z 1.6613520658462367e+09 DEBUG events record/event.go:311 Warning {"object": {"kind":"VolumeReplicationGroup","namespace":"busybox-workloads-2","name":"busybox-drpc","uid":"ea77ab0b-6090-403e-b5a6-2690ed3948ca","apiVersion":"ramendr.openshift.io/v1alpha1","resourceVersion":"42862598"}, "reason": "PVUploadFailed", "message": "error uploading PV to s3Profile s3profile-vmware-dccp-one-ocs-storagecluster, failed to protect cluster data for PVC busybox-pvc-26, failed to upload data of odrbucket-ebd82b72a969:busybox-workloads-2/busybox-drpc/v1.PersistentVolume/pvc-e422ec54-655e-41f2-9a18-9b400c80715d, SerializationError: failed to unmarshal error message\n\tstatus code: 504, request id: , host id: \ncaused by: UnmarshalError: failed to unmarshal error message\n\t00000000 3c 68 74 6d 6c 3e 3c 62 6f 64 79 3e 3c 68 31 3e |<html><body><h1>|\n00000010 35 30 34 20 47 61 74 65 77 61 79 20 54 69 6d 65 |504 Gateway Time|\n00000020 2d 6f 75 74 3c 2f 68 31 3e 0a 54 68 65 20 73 65 |-out</h1>.The se|\n00000030 72 76 65 72 20 64 69 64 6e 27 74 20 72 65 73 70 |rver didn't resp|\n00000040 6f 6e 64 20 69 6e 20 74 69 6d 65 2e 0a 3c 2f 62 |ond in time..</b|\n00000050 6f 64 79 3e 3c 2f 68 74 6d 6c 3e 0a |ody></html>.|\n\ncaused by: expected element type <Error> but have <html>"}
The core error seems to be the following:
2022-08-24T15:01:40.566331893Z caused by: UnmarshalError: failed to unmarshal error message
2022-08-24T15:01:40.566331893Z 00000000 3c 68 74 6d 6c 3e 3c 62 6f 64 79 3e 3c 68 31 3e |<html><body><h1>|
2022-08-24T15:01:40.566331893Z 00000010 35 30 34 20 47 61 74 65 77 61 79 20 54 69 6d 65 |504 Gateway Time|
2022-08-24T15:01:40.566331893Z 00000020 2d 6f 75 74 3c 2f 68 31 3e 0a 54 68 65 20 73 65 |-out</h1>.The se|
2022-08-24T15:01:40.566331893Z 00000030 72 76 65 72 20 64 69 64 6e 27 74 20 72 65 73 70 |rver didn't resp|
2022-08-24T15:01:40.566331893Z 00000040 6f 6e 64 20 69 6e 20 74 69 6d 65 2e 0a 3c 2f 62 |ond in time..</b|
2022-08-24T15:01:40.566331893Z 00000050 6f 64 79 3e 3c 2f 68 74 6d 6c 3e 0a |ody></html>.|
IOW, the upload is timing out as the S3 end point is not reachable. It is unable to reach the S3 profile s3profile-vmware-dccp-one-ocs-storagecluster.
Nooba logs on the server have the following errors:
NOTE: Unsure if this is the cause of the problem.
time="2022-08-24T15:51:03Z" level=info msg="Start BucketClass Reconcile..." bucketclass=openshift-storage/noobaa-default-bucket-class
time="2022-08-24T15:51:03Z" level=info msg="✅ Exists: NooBaa \"noobaa\"\n"
time="2022-08-24T15:51:03Z" level=info msg="✅ Exists: BucketClass \"noobaa-default-bucket-class\"\n"
time="2022-08-24T15:51:03Z" level=info msg="SetPhase: Verifying" bucketclass=openshift-storage/noobaa-default-bucket-class
time="2022-08-24T15:51:03Z" level=info msg="RPC: Ping (0xc001900240) &{RPC:0xc000119220 Address:wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/ State:connected WS:0xc00193c700 PendingRequests:map[wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/-8194:0xc00045d830 wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/-8293:0xc0007f6480 wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/-8294:0xc001017dd0 wss://noobaa-mgmt.openshift-storage.svc.cluster.local:443/rpc/-8295:0xc001947590] NextRequestID:8296 Lock:{state:0 sema:0} ReconnectDelay:0s cancelPings:0x5ed060}"
time="2022-08-24T15:51:03Z" level=info msg="✅ Exists: BackingStore \"noobaa-default-backing-store\"\n"
time="2022-08-24T15:51:03Z" level=info msg="SetPhase: temporary error during phase \"Verifying\"" bucketclass=openshift-storage/noobaa-default-bucket-class
time="2022-08-24T15:51:03Z" level=warning msg="⏳ Temporary Error: NooBaa BackingStore \"noobaa-default-backing-store\" is not yet ready" bucketclass=openshift-storage/noobaa-default-bucket-class
time="2022-08-24T15:51:03Z" level=info msg="UpdateStatus: Done" bucketclass=openshift-storage/noobaa-default-bucket-class
@prsurve Can you ensure Nooba health is as desired? It seems to be that the S3 store is having issues causing the said problem.
Closing the bug as not a bug Looking at the system i found that the backing store was down $oc get backingstores.noobaa.io NAME TYPE PHASE AGE noobaa-default-backing-store s3-compatible Connecting 14d to resolve it I deleted the noobaa-core-0 pod and the issue got resolved and relocate operation was completed successfully. $oc get drpc busybox-workloads-3 busybox-drpc 14d prsurve-vm-d Relocate Relocated |