2264002 – Log SiteStatuses details and description in GetVolumeReplicationInfo RPC call

Bug 2264002 - Log SiteStatuses details and description in GetVolumeReplicationInfo RPC call

Summary: Log SiteStatuses details and description in GetVolumeReplicationInfo RPC call

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	yati padia
QA Contact:	Oded
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-02-13 09:36 UTC by yati padia
Modified:	2024-03-19 15:32 UTC (History)
CC List:	4 users (show)
Fixed In Version:	4.15.0-147
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-03-19 15:32:44 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-csi pull 4431	None	open	rbd: log sitestatuses and description	2024-02-16 17:05:48 UTC
Github	ceph ceph-csi pull 4447	None	Merged	rbd: log sitestatuses and description (backport #4431)	2024-02-20 09:04:11 UTC
Github	red-hat-storage ceph-csi pull 260	None	open	BUG 2264002: rbd: log sitestatuses and description	2024-02-20 14:03:37 UTC
Red Hat Product Errata	RHSA-2024:1383	None	None	None	2024-03-19 15:32:47 UTC

Description yati padia 2024-02-13 09:36:37 UTC

Description of problem (please be detailed as possible and provide log
snippests):

Log SiteStatuses details and description in GetVolumeReplicationInfo RPC call for better debuging.

Version of all relevant components (if applicable):

4.15

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Problem debugging the root cause for failure due to absence of proper debug logs

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

2

Can this issue reproducible?

yes,

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 7 Oded 2024-02-22 12:37:19 UTC

I don't understand the test procedure. how to verify this bz?

Comment 9 Oded 2024-03-07 18:47:22 UTC

I dont see the logs of csi-rbdplugin container of rbd provisioner pods.

ODF Version:4.15.0-155
OCP Version: 4.15.0
RDR cluster
platform: vsphere

1.Craete PVC [SC=ceph-rbd]:
oviner~/ClusterPath/auth$ oc get pvc -n openshift-storage pvc-test 
NAME       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
pvc-test   Bound    pvc-35fea3a4-849d-442c-803a-b4391749b6ec   2Gi        RWO            ocs-storagecluster-ceph-rbd   59s

oviner~/ClusterPath/auth$ oc get pvc -n openshift-storage pvc-test -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    reclaimspace.csiaddons.openshift.io/cronjob: pvc-test-1709835591
    reclaimspace.csiaddons.openshift.io/schedule: '@weekly'
    volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
    volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
  creationTimestamp: "2024-03-07T18:19:51Z"
  finalizers:
  - kubernetes.io/pvc-protection
  name: pvc-test
  namespace: openshift-storage
  resourceVersion: "1916048"
  uid: 35fea3a4-849d-442c-803a-b4391749b6ec
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
  storageClassName: ocs-storagecluster-ceph-rbd
  volumeMode: Filesystem
  volumeName: pvc-35fea3a4-849d-442c-803a-b4391749b6ec
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 2Gi
  phase: Bound


2.Create VolumeReplicationClass:
$ oc create -f VolumeReplicationClass.yaml 
volumereplicationclass.replication.storage.openshift.io/volumereplicationclass-sample created

oviner~/ClusterPath/auth$ cat VolumeReplicationClass.yaml
apiVersion: replication.storage.openshift.io/v1alpha1
kind: VolumeReplicationClass
metadata:
  name: volumereplicationclass-sample
spec:
  provisioner: example.provisioner.io
  parameters:
    replication.storage.openshift.io/replication-secret-name: secret-name
    replication.storage.openshift.io/replication-secret-namespace: secret-namespace
    # schedulingInterval is a vendor specific parameter. It is used to set the
    # replication scheduling interval for storage volumes that are replication
    # enabled using related VolumeReplication resource
    schedulingInterval: 1m

3.Create VolumeReplication:
$ oc create -f VolumeReplication.yaml 
volumereplication.replication.storage.openshift.io/my-vrt2 created

$ cat VolumeReplication.yaml 
apiVersion: replication.storage.openshift.io/v1alpha1
kind: VolumeReplication
metadata:
  name: my-vrt2
  namespace: openshift-storage
spec:
  volumeReplicationClass: volumereplicationclass-sample
  replicationState: primary
  replicationHandle: replicationHandle # optional
  dataSource:
    kind: PersistentVolumeClaim
    name: pvc-test

4.Check the logs of csi-rbdplugin container of rbd provisioner pods
$ oc get pods | grep csi-rbdplugin-provisioner
csi-rbdplugin-provisioner-698f984d84-hblzp                        7/7     Running     1 (5h21m ago)   29h
csi-rbdplugin-provisioner-698f984d84-zc66w                        7/7     Running     1 (5h8m ago)    29h

oviner~/ClusterPath/auth$ oc logs csi-rbdplugin-provisioner-698f984d84-zc66w -n openshift-storage --all-containers | grep bytes_per_second
oviner~/ClusterPath/auth$ oc logs csi-rbdplugin-provisioner-698f984d84-hblzp -n openshift-storage --all-containers | grep bytes_per_second
oviner~/ClusterPath/auth$

Yati, 
The odr-cluster-operator.v4.15.0-rhodf not in Succeeded mode, Do you think I dont see the log because odr-cluster-operator.v4.15.0-rhodf issue? 

$ oc get csv odr-cluster-operator.v4.15.0-rhodf
NAME                                 DISPLAY                         VERSION        REPLACES   PHASE
odr-cluster-operator.v4.15.0-rhodf   Openshift DR Cluster Operator   4.15.0-rhodf              Succeeded

    Message:               installing: waiting for deployment ramen-dr-cluster-operator to become ready: deployment "ramen-dr-cluster-operator" not available: Deployment does not have minimum availability.

Comment 11 Mudit Agarwal 2024-03-11 14:49:37 UTC

Hi Oded,

Why is this in FailedQA because:
1. You didn't the logs of csi-rbdplugin container of rbd provisioner pods OR
2. You are not able to see the DR related logs there?

If it is first one then it is nothing to do with this bug and you need to check why logs are not accessible.
If it is second one and ramen-dr-cluster-operator is not available then it is expected.

Comment 12 Oded 2024-03-11 16:20:14 UTC

The manual test failed because the configuration of VolumeReplicationClass was incorrect.
After changing the VolumeReplicationClass configuration the test pass:

1.Craete PVC [SC=ceph-rbd]:
$ oc get pvc pvc-test 
NAME       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                  AGE
pvc-test   Bound    pvc-fa08dc48-e1d1-469c-b474-a0566ceca8c3   3Gi        RWO            ocs-storagecluster-ceph-rbd   15m

2.Create VolumeReplicationClass:
$ oc get vrc rbd-volumereplicationclass-1625360775  -o yaml 

apiVersion: replication.storage.openshift.io/v1alpha1
kind: VolumeReplicationClass
metadata:
  creationTimestamp: "2024-03-07T08:10:32Z"
  generation: 1
  labels:
    ramendr.openshift.io/maintenancemodes: Failover
    ramendr.openshift.io/replicationid: f9a754d3d1a7420fadc825ee5a5c67fbc809c91
  name: rbd-volumereplicationclass-1625360775
  ownerReferences:
  - apiVersion: work.open-cluster-management.io/v1
    kind: AppliedManifestWork
    name: a886abc37b147c9bcb446cc55d8427d165d0a651db5133c8bebc9104e5ec8b1b-vrc-1272416414
    uid: 4ef58a4b-59ea-4e70-9961-5438e02bce60
  resourceVersion: "3686329"
  uid: 658b57c0-7536-45a4-a746-21ad33857041
spec:
  parameters:
    mirroringMode: snapshot
    replication.storage.openshift.io/replication-secret-name: rook-csi-rbd-provisioner
    replication.storage.openshift.io/replication-secret-namespace: openshift-storage
    schedulingInterval: 5m
  provisioner: openshift-storage.rbd.csi.ceph.com

3.Create VolumeReplication:
$ oc get vr my-vrt2 -o yaml
apiVersion: replication.storage.openshift.io/v1alpha1
kind: VolumeReplication
metadata:
  creationTimestamp: "2024-03-11T16:05:26Z"
  finalizers:
  - replication.storage.openshift.io
  generation: 1
  name: my-vrt2
  namespace: openshift-storage
  resourceVersion: "10404798"
  uid: ae8b1712-4fb7-4f8b-9dd9-0c1c6567ad94
spec:
  autoResync: false
  dataSource:
    kind: PersistentVolumeClaim
    name: pvc-test
  replicationHandle: replicationHandle
  replicationState: primary
  volumeReplicationClass: rbd-volumereplicationclass-1625360775
status:
  conditions:
  - lastTransitionTime: "2024-03-11T16:05:36Z"
    message: ""
    observedGeneration: 1
    reason: Promoted
    status: "True"
    type: Completed
  - lastTransitionTime: "2024-03-11T16:05:36Z"
    message: ""
    observedGeneration: 1
    reason: Healthy
    status: "False"
    type: Degraded
  - lastTransitionTime: "2024-03-11T16:05:36Z"
    message: ""
    observedGeneration: 1
    reason: NotResyncing
    status: "False"
    type: Resyncing
  lastCompletionTime: "2024-03-11T16:13:06Z"
  lastSyncDuration: 0s
  lastSyncTime: "2024-03-11T16:10:10Z"
  message: volume is marked primary
  observedGeneration: 1
  state: Primary


4.Check the logs of csi-rbdplugin container of rbd provisioner pods
$ oc logs csi-rbdplugin-provisioner-6fdf9f5d64-bg6tp -n openshift-storage --all-containers | grep "Site status of"
884 Site status of MirrorUUID: 829c0d02-8a7d-4812-86f7-897a0048d4c1, state: replaying, description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":0.0,"last_snapshot_bytes":0,"last_snapshot_sync_seconds":0,"local_snapshot_timestamp":1710173701,"remote_snapshot_timestamp":1710173701,"replay_state":"idle"}, lastUpdate: 1710173714, up: true

Comment 13 errata-xmlrpc 2024-03-19 15:32:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Note You need to log in before you can comment on or make changes to this bug.