2322019 – [RDR] [Flatten] Proper error messages aren't shown when a drpolicy without flattening is applied to cloned/snapshot restored PVC

Bug 2322019 - [RDR] [Flatten] Proper error messages aren't shown when a drpolicy without flattening is applied to cloned/snapshot restored PVC [NEEDINFO]

Summary: [RDR] [Flatten] Proper error messages aren't shown when a drpolicy without fl...

Keywords:
Status:	ASSIGNED
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	csi-addons
Sub Component:
Version:	4.17
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Niels de Vos
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-10-27 19:26 UTC by Aman Agrawal
Modified:	2024-10-28 15:21 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
Flags:	nsoffer: needinfo? (rar)

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OCSBZM-9439	0	None	None	None	2024-10-27 19:28:04 UTC

Description Aman Agrawal 2024-10-27 19:26:44 UTC

Description of problem (please be detailed as possible and provide log
snippests):


Version of all relevant components (if applicable):
OCP 4.17.0-0.nightly-2024-10-20-231827
ODF 4.17.0-126
ACM 2.12.0-DOWNSTREAM-2024-10-18-21-57-41
OpenShift Virtualization 4.17.1-19
Submariner 0.19 unreleased downstream image 846949
ceph version 18.2.1-229.el9cp (ef652b206f2487adfc86613646a4cac946f6b4e0) reef (stable)
OADP 1.4.1
OpenShift GitOps 1.14.0
VolSync 0.10.1


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy a RBD CNV workload on a RDR setup using discovered apps. Create a clone of the PVC.
2. Delete the workload. 
3. Now deploy a workload in such a way that it consumes the cloned PVC.
4. DR protect this workload with a drpolicy where flattening is not enabled.
5. The VR will go to primary and sync and backup would initially look fine for the workload but the RBD image will not undergo flattening.
6. After a while, sync wouldn't continue for this workload and it's hard to debug the root cause because proper error messages are missing in the VR/DRPC resource.


Actual results:
VR-

oc describe vr -n busybox-workloads-100
Name:         root-disk
Namespace:    busybox-workloads-100
Labels:       ramendr.openshift.io/owner-name=busybox-100
              ramendr.openshift.io/owner-namespace-name=openshift-dr-ops
Annotations:  <none>
API Version:  replication.storage.openshift.io/v1alpha1
Kind:         VolumeReplication
Metadata:
  Creation Timestamp:  2024-10-27T18:04:31Z
  Finalizers:
    replication.storage.openshift.io
  Generation:        1
  Resource Version:  9855180
  UID:               c4ae8511-9fa1-4a53-8374-8b87288255d1
Spec:
  Auto Resync:  false
  Data Source:
    API Group:
    Kind:                    PersistentVolumeClaim
    Name:                    root-disk
  Replication Handle:
  Replication State:         primary
  Volume Replication Class:  rbd-volumereplicationclass-1625360775
Status:
  Conditions:
    Last Transition Time:  2024-10-27T18:04:35Z
    Message:
    Observed Generation:   1
    Reason:                Promoted
    Status:                True
    Type:                  Completed
    Last Transition Time:  2024-10-27T18:04:35Z
    Message:
    Observed Generation:   1
    Reason:                Healthy
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2024-10-27T18:04:35Z
    Message:
    Observed Generation:   1
    Reason:                NotResyncing
    Status:                False
    Type:                  Resyncing
  Last Completion Time:    2024-10-27T18:47:06Z
  Last Sync Duration:      0s
  Last Sync Time:          2024-10-27T18:45:00Z
  Message:                 volume is marked primary
  Observed Generation:     1
  State:                   Primary
Events:                    <none>


DRPC-

oc get drpc busybox-100 -oyaml -n openshift-dr-ops
apiVersion: ramendr.openshift.io/v1alpha1
kind: DRPlacementControl
metadata:
  annotations:
    drplacementcontrol.ramendr.openshift.io/app-namespace: openshift-dr-ops
    drplacementcontrol.ramendr.openshift.io/last-app-deployment-cluster: amagrawa-21o-1
  creationTimestamp: "2024-10-27T18:04:31Z"
  finalizers:
  - drpc.ramendr.openshift.io/finalizer
  generation: 2
  labels:
    cluster.open-cluster-management.io/backup: ramen
  name: busybox-100
  namespace: openshift-dr-ops
  ownerReferences:
  - apiVersion: cluster.open-cluster-management.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Placement
    name: busybox-100-placement-1
    uid: e36cc23e-b6ad-4e24-ab76-0b8f2332aa9e
  resourceVersion: "8573969"
  uid: 552aaddd-3376-4550-ba3d-b7150e27ac91
spec:
  drPolicyRef:
    apiVersion: ramendr.openshift.io/v1alpha1
    kind: DRPolicy
    name: odr-policy-5m
  kubeObjectProtection:
    captureInterval: 5m0s
    kubeObjectSelector:
      matchExpressions:
      - key: appname
        operator: In
        values:
        - vm
  placementRef:
    apiVersion: cluster.open-cluster-management.io/v1beta1
    kind: Placement
    name: busybox-100-placement-1
    namespace: openshift-dr-ops
  preferredCluster: amagrawa-21o-1
  protectedNamespaces:
  - busybox-workloads-100
  pvcSelector:
    matchExpressions:
    - key: appname
      operator: In
      values:
      - vm
status:
  actionDuration: 15.045573062s
  actionStartTime: "2024-10-27T18:04:46Z"
  conditions:
  - lastTransitionTime: "2024-10-27T18:04:31Z"
    message: Initial deployment completed
    observedGeneration: 2
    reason: Deployed
    status: "True"
    type: Available
  - lastTransitionTime: "2024-10-27T18:04:31Z"
    message: Ready
    observedGeneration: 2
    reason: Success
    status: "True"
    type: PeerReady
  - lastTransitionTime: "2024-10-27T18:07:31Z"
    message: VolumeReplicationGroup (openshift-dr-ops/busybox-100) on cluster amagrawa-21o-1
      is protecting required resources and data
    observedGeneration: 2
    reason: Protected
    status: "True"
    type: Protected
  lastGroupSyncDuration: 0s
  lastGroupSyncTime: "2024-10-27T18:10:00Z"
  lastKubeObjectProtectionTime: "2024-10-27T18:54:38Z"
  lastUpdateTime: "2024-10-27T18:59:33Z"
  observedGeneration: 2
  phase: Deployed
  preferredDecision:
    clusterName: amagrawa-21o-1
    clusterNamespace: amagrawa-21o-1
  progression: Completed
  resourceConditions:
    conditions:
    - lastTransitionTime: "2024-10-27T18:04:35Z"
      message: PVCs in the VolumeReplicationGroup are ready for use
      observedGeneration: 1
      reason: Ready
      status: "True"
      type: DataReady
    - lastTransitionTime: "2024-10-27T18:04:32Z"
      message: VolumeReplicationGroup is replicating
      observedGeneration: 1
      reason: Replicating
      status: "False"
      type: DataProtected
    - lastTransitionTime: "2024-10-27T18:04:31Z"
      message: Nothing to restore
      observedGeneration: 1
      reason: Restored
      status: "True"
      type: ClusterDataReady
    - lastTransitionTime: "2024-10-27T18:04:39Z"
      message: Cluster data of all PVs are protected
      observedGeneration: 1
      reason: Uploaded
      status: "True"
      type: ClusterDataProtected
    resourceMeta:
      generation: 1
      kind: VolumeReplicationGroup
      name: busybox-100
      namespace: openshift-dr-ops
      protectedpvcs:
      - root-disk
      resourceVersion: "9869528"


Although it fires the VolumeSyncronizationDelay alert on the hub cluster if cluster monitoring labelling is done, please note that it's optional and doesn't highlight where the root cause is. There could be n number of reasons why sync isn't progressing?

Also, to check if image under went flattening or not, one has to rsh into toolbox pod and run ceph progress command which isn't recommended for customers. 



Expected results:[RDR] [Flatten] Proper error messages should be shown in VR and DRPC resource when a drpolicy without flattening is applied to cloned/snapshot restored PVC and sync doesn't resume/rbd-image doesn't undergo flattening.


Additional info:

Note You need to log in before you can comment on or make changes to this bug.