Bug 2189893

Summary:	[RDR] When Appset workloads are failover to secondary when primary cluster is power off only pvc are created on the failover cluster
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Pratik Surve <prsurve>
Component:	odf-dr	Assignee:	Karolin Seeger <kseeger>
odf-dr sub component:	ramen	QA Contact:	Sidhant Agrawal <sagrawal>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	urgent
Priority:	unspecified	CC:	muagarwa, odf-bz-bot, srangana
Version:	4.13	Keywords:	AutomationBackLog
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-08-21 21:27:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pratik Surve 2023-04-26 10:29:30 UTC

Description of problem (please be detailed as possible and provide log
snippets):

[RDR] When Appset workloads are failover to secondary when primary cluster is power off only pvc are created on the failover cluster

Version of all relevant components (if applicable):

OCP version:- 4.13.0-0.nightly-2023-04-18-005127
ODF version:- 4.13.0-168
CEPH version:- ceph version 17.2.6-10.el9cp (19b8858bfb3d0d1b84ec6f0d3fd7c6148831f7c8) quincy (stable)
ACM version:- 2.8.0-125
SUBMARINER version:- v0.15.0
VOLSYNC version:- volsync-product.v0.7.1

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy RDR cluster
2.Deploy Appset workloads
3.Poweroff c1 and then failover to c2


Actual results:
$ oc get pvc,pods                                                          
NAME                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
persistentvolumeclaim/dd-io-pvc-1   Bound    pvc-7ecd18e9-1219-4d9d-8cfe-2f18539639e8   117Gi      RWO            ocs-storagecluster-ceph-rbd   5h17m
persistentvolumeclaim/dd-io-pvc-2   Bound    pvc-d924d274-f52d-44cd-981d-5cdc6756dd7c   143Gi      RWO            ocs-storagecluster-ceph-rbd   5h17m
persistentvolumeclaim/dd-io-pvc-3   Bound    pvc-1504ea65-8db7-4221-9735-6667dfdcbca3   134Gi      RWO            ocs-storagecluster-ceph-rbd   5h17m
persistentvolumeclaim/dd-io-pvc-4   Bound    pvc-f3358c21-9d62-43f8-940d-d46593a771d6   106Gi      RWO            ocs-storagecluster-ceph-rbd   5h17m
persistentvolumeclaim/dd-io-pvc-5   Bound    pvc-61af163e-342b-47c7-a112-c183f30d14e6   115Gi      RWO            ocs-storagecluster-ceph-rbd   5h17m
persistentvolumeclaim/dd-io-pvc-6   Bound    pvc-b7419f97-fbb5-4e2a-87e8-3dd98907b415   129Gi      RWO            ocs-storagecluster-ceph-rbd   5h17m
persistentvolumeclaim/dd-io-pvc-7   Bound    pvc-394c5df7-dc9f-4c6c-a760-a9959497a1a7   149Gi      RWO            ocs-storagecluster-ceph-rbd   5h17m
 

No pods were created 

Expected results:
pods should be also created 

Additional info:
tolerations:
  - key: cluster.open-cluster-management.io/unreachable
    operator: Exists 

was added to Placement

Comment 4 Shyamsundar 2023-04-26 12:26:02 UTC

Observations:

- DRPC now waits for VRG to report all PVCs as Primary before updating the Placement and rolling out the workload to the cluster (failover/relocate)
- In this situation the PVC is not yet primary and so the pods are not seen on the failover cluster
- The PVCs are not primary as the command to force promote is timing out constantly

As failover requires that the RBD mirror daemon be shutdown on the target cluster, and in this case it was not (potentially due to expecting the new maintenance mode feature to ensure the same, but that is not functioning as yet in this build), the force promote was constantly stuck.

I am moving this to ON_QA for observation from build 176++ which should have the auto disable of RBD fixed (as commits are merged downstream, but not yet built with 176).

Data:

oc get vr -n app-busybox-8-1 dd-io-pvc-1 -o yaml

message: 'failed to promote image "ocs-storagecluster-cephblockpool/csi-vol-68cb8d51-e316-416e-8351-3754db8b7381"
    with error: an error (timeout: context deadline exceeded) and stderror () occurred
    while running rbd args: [mirror image promote ocs-storagecluster-cephblockpool/csi-vol-68cb8d51-e316-416e-8351-3754db8b7381
    --force --id csi-rbd-provisioner -m 242.1.255.248:3300,242.1.255.250:3300,242.1.255.249:3300
    --keyfile=***stripped***]'

bash-5.1$ rbd mirror image status ocs-storagecluster-cephblockpool/csi-vol-d5accdea-5d8e-44ea-a739-3758b58fe0e5
csi-vol-d5accdea-5d8e-44ea-a739-3758b58fe0e5:
  global_id:   6d8484c5-0931-4ed4-9d1e-bd083a3a1811
  state:       down+replaying
  description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":31643648.0,"last_snapshot_bytes":42160128,"last_snapshot_sync_seconds":16,"remote_snapshot_timestamp":1682484301,"replay_state":"syncing","syncing_percent":0,"syncing_snapshot_timestamp":1682484301}
  last_update: 2023-04-26 12:01:48
  peer_sites:
    name: aaa2b599-056a-4087-9ece-cff0fa2963b1
    state: down+stopped
    description: local image is primary
    last_update: 2023-04-26 04:45:05
  snapshots:
    2647 .mirror.primary.6d8484c5-0931-4ed4-9d1e-bd083a3a1811.e69321e5-35da-41a1-8152-b2f8a5addaa1 (peer_uuids:[4bd73e12-49aa-48a6-ac56-bf061cc31b5e])
    2652 .mirror.primary.6d8484c5-0931-4ed4-9d1e-bd083a3a1811.70bccce0-49ff-47a8-977e-f6c8e1cff7c6 (peer_uuids:[4bd73e12-49aa-48a6-ac56-bf061cc31b5e])