Bug 2189893
| Summary: | [RDR] When Appset workloads are failover to secondary when primary cluster is power off only pvc are created on the failover cluster | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Pratik Surve <prsurve> |
| Component: | odf-dr | Assignee: | Karolin Seeger <kseeger> |
| odf-dr sub component: | ramen | QA Contact: | Sidhant Agrawal <sagrawal> |
| Status: | VERIFIED --- | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | unspecified | CC: | muagarwa, odf-bz-bot, srangana |
| Version: | 4.13 | Keywords: | AutomationBackLog |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Observations:
- DRPC now waits for VRG to report all PVCs as Primary before updating the Placement and rolling out the workload to the cluster (failover/relocate)
- In this situation the PVC is not yet primary and so the pods are not seen on the failover cluster
- The PVCs are not primary as the command to force promote is timing out constantly
As failover requires that the RBD mirror daemon be shutdown on the target cluster, and in this case it was not (potentially due to expecting the new maintenance mode feature to ensure the same, but that is not functioning as yet in this build), the force promote was constantly stuck.
I am moving this to ON_QA for observation from build 176++ which should have the auto disable of RBD fixed (as commits are merged downstream, but not yet built with 176).
Data:
oc get vr -n app-busybox-8-1 dd-io-pvc-1 -o yaml
message: 'failed to promote image "ocs-storagecluster-cephblockpool/csi-vol-68cb8d51-e316-416e-8351-3754db8b7381"
with error: an error (timeout: context deadline exceeded) and stderror () occurred
while running rbd args: [mirror image promote ocs-storagecluster-cephblockpool/csi-vol-68cb8d51-e316-416e-8351-3754db8b7381
--force --id csi-rbd-provisioner -m 242.1.255.248:3300,242.1.255.250:3300,242.1.255.249:3300
--keyfile=***stripped***]'
bash-5.1$ rbd mirror image status ocs-storagecluster-cephblockpool/csi-vol-d5accdea-5d8e-44ea-a739-3758b58fe0e5
csi-vol-d5accdea-5d8e-44ea-a739-3758b58fe0e5:
global_id: 6d8484c5-0931-4ed4-9d1e-bd083a3a1811
state: down+replaying
description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":31643648.0,"last_snapshot_bytes":42160128,"last_snapshot_sync_seconds":16,"remote_snapshot_timestamp":1682484301,"replay_state":"syncing","syncing_percent":0,"syncing_snapshot_timestamp":1682484301}
last_update: 2023-04-26 12:01:48
peer_sites:
name: aaa2b599-056a-4087-9ece-cff0fa2963b1
state: down+stopped
description: local image is primary
last_update: 2023-04-26 04:45:05
snapshots:
2647 .mirror.primary.6d8484c5-0931-4ed4-9d1e-bd083a3a1811.e69321e5-35da-41a1-8152-b2f8a5addaa1 (peer_uuids:[4bd73e12-49aa-48a6-ac56-bf061cc31b5e])
2652 .mirror.primary.6d8484c5-0931-4ed4-9d1e-bd083a3a1811.70bccce0-49ff-47a8-977e-f6c8e1cff7c6 (peer_uuids:[4bd73e12-49aa-48a6-ac56-bf061cc31b5e])
|
Description of problem (please be detailed as possible and provide log snippets): [RDR] When Appset workloads are failover to secondary when primary cluster is power off only pvc are created on the failover cluster Version of all relevant components (if applicable): OCP version:- 4.13.0-0.nightly-2023-04-18-005127 ODF version:- 4.13.0-168 CEPH version:- ceph version 17.2.6-10.el9cp (19b8858bfb3d0d1b84ec6f0d3fd7c6148831f7c8) quincy (stable) ACM version:- 2.8.0-125 SUBMARINER version:- v0.15.0 VOLSYNC version:- volsync-product.v0.7.1 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? yes Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Deploy RDR cluster 2.Deploy Appset workloads 3.Poweroff c1 and then failover to c2 Actual results: $ oc get pvc,pods NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE persistentvolumeclaim/dd-io-pvc-1 Bound pvc-7ecd18e9-1219-4d9d-8cfe-2f18539639e8 117Gi RWO ocs-storagecluster-ceph-rbd 5h17m persistentvolumeclaim/dd-io-pvc-2 Bound pvc-d924d274-f52d-44cd-981d-5cdc6756dd7c 143Gi RWO ocs-storagecluster-ceph-rbd 5h17m persistentvolumeclaim/dd-io-pvc-3 Bound pvc-1504ea65-8db7-4221-9735-6667dfdcbca3 134Gi RWO ocs-storagecluster-ceph-rbd 5h17m persistentvolumeclaim/dd-io-pvc-4 Bound pvc-f3358c21-9d62-43f8-940d-d46593a771d6 106Gi RWO ocs-storagecluster-ceph-rbd 5h17m persistentvolumeclaim/dd-io-pvc-5 Bound pvc-61af163e-342b-47c7-a112-c183f30d14e6 115Gi RWO ocs-storagecluster-ceph-rbd 5h17m persistentvolumeclaim/dd-io-pvc-6 Bound pvc-b7419f97-fbb5-4e2a-87e8-3dd98907b415 129Gi RWO ocs-storagecluster-ceph-rbd 5h17m persistentvolumeclaim/dd-io-pvc-7 Bound pvc-394c5df7-dc9f-4c6c-a760-a9959497a1a7 149Gi RWO ocs-storagecluster-ceph-rbd 5h17m No pods were created Expected results: pods should be also created Additional info: tolerations: - key: cluster.open-cluster-management.io/unreachable operator: Exists was added to Placement