Bug 2239589

Summary: [RDR][Tracker] Few rbd images remain stuck in starting_replay post failover, image_health also reports warning
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Aman Agrawal <amagrawa>
Component: odf-drAssignee: Shyamsundar <srangana>
odf-dr sub component: ramen QA Contact: Aman Agrawal <amagrawa>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: unspecified CC: idryomov, kseeger, mrajanna, muagarwa, rtalur, srangana
Version: 4.14   
Target Milestone: ---   
Target Release: ODF 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 4.14.0-139 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-08 18:54:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Aman Agrawal 2023-09-19 07:32:50 UTC
Description of problem (please be detailed as possible and provide log
snippests): These 2 rbd based workloads were failedover which were in deployed state earlier.
amagrawa:acm$ drpc
NAMESPACE             NAME                                   AGE     PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION           PEER READY
busybox-workloads-2   busybox-workloads-2-placement-1-drpc   2d23h   amagrawa-c1        amagrawa-c2       Failover       FailedOver     Completed     2023-09-18T19:18:58Z   48m18.757117217s   True

openshift-gitops      busybox-workloads-1-placement-drpc     2d23h   amagrawa-c1        amagrawa-c2       Failover       FailedOver     Completed     2023-09-18T19:17:45Z   44m1.2498824s      True




Version of all relevant components (if applicable):
ODF 4.14.0-132.stable
OCP 4.14.0-0.nightly-2023-09-02-132842
ACM 2.9.0-DOWNSTREAM-2023-08-24-09-30-12
subctl version: v0.16.0
ceph version 17.2.6-138.el9cp (b488c8dad42b2ecffcd96f3d76eeeecce48b8590) quincy (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Deploy appset and subscription based DR protected rbd based workloads on a RDR setup and run IOs for a few days (4-5 days in this case)
2. Perform submariner connectivity check, ensure mirroring is working, lastGroupSyncTime is within desired range, rbd image status is healthy, etc.
3. Bring master nodes of primary cluster down
4. Perform failover of both appset and subscription based workloads when cluster is marked unavailable on the ACM UI
5. Wait for Pods and other resources to come up on the new primary, drpc progression state to change to Cleaning up
6. Bring master nodes up, let the cluster become reachable. 
7. Ensure subctl verify connectivity check passes for submariner connectivity, wait for cleanup to complete.
8. Check rbd image status again.

Actual results: Few rbd images remain stuck in starting_replay post failover, image_health also reports warning

C1-
amagrawa:~$ mirror
{
  "lastChecked": "2023-09-18T20:14:39Z",
  "summary": {
    "daemon_health": "OK",
    "health": "WARNING",
    "image_health": "WARNING",
    "states": {
      "replaying": 9,
      "starting_replay": 3
    }
  }


csi-vol-4d7b71bc-afa5-478b-8a1b-cf372eb5f297
csi-vol-4d7b71bc-afa5-478b-8a1b-cf372eb5f297:
  global_id:   1aec3522-725e-4e9c-8d1b-134a861d55b8
  state:       up+replaying
  description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":52676608.0,"last_snapshot_bytes":63197184,"last_snapshot_sync_seconds":1,"local_snapshot_timestamp":1695067800,"remote_snapshot_timestamp":1695067800,"replay_state":"idle"}
  service:     a on compute-0
  last_update: 2023-09-18 20:16:30
  peer_sites:
    name: 724e0358-2cfc-4a0f-9a99-419999493584
    state: up+starting_replay
    description: starting replay
    last_update: 2023-09-18 20:16:18
##########################################
csi-vol-b3031480-5f84-4bf4-b976-6809c32fa1f8
csi-vol-b3031480-5f84-4bf4-b976-6809c32fa1f8:
  global_id:   6ba80b97-eb9c-4eff-8ffe-d691b14befee
  state:       up+replaying
  description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":284784640.0,"last_snapshot_bytes":389464064,"last_snapshot_sync_seconds":7,"local_snapshot_timestamp":1695068100,"remote_snapshot_timestamp":1695068100,"replay_state":"idle"}
  last_update: 2023-09-18 20:16:24
  peer_sites:
    name: 724e0358-2cfc-4a0f-9a99-419999493584
    state: up+starting_replay
    description: starting replay
    last_update: 2023-09-18 20:16:18
##########################################
csi-vol-d4e319c8-2329-482c-9804-7bc023b59d22
csi-vol-d4e319c8-2329-482c-9804-7bc023b59d22:
  global_id:   aaddeb70-0e98-4389-9c3c-ab0e385e76de
  state:       up+replaying
  description: replaying, {"bytes_per_second":0.0,"bytes_per_snapshot":326821888.0,"last_snapshot_bytes":445595648,"last_snapshot_sync_seconds":8,"local_snapshot_timestamp":1695068100,"remote_snapshot_timestamp":1695068100,"replay_state":"idle"}
  last_update: 2023-09-18 20:16:24
  peer_sites:
    name: 724e0358-2cfc-4a0f-9a99-419999493584
    state: up+starting_replay
    description: starting replay
    last_update: 2023-09-18 20:16:18
##########################################



C2-
amagrawa:c2$ mirror
{
  "lastChecked": "2023-09-18T20:34:46Z",
  "summary": {
    "daemon_health": "OK",
    "health": "WARNING",
    "image_health": "WARNING",
    "states": {
      "replaying": 9,
      "starting_replay": 3
    }
  }
}


Must gather logs are kept here- http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-aman/19sept23-1/

Subctl verify logs collected post failover when older primary cluster became reachable- http://pastebin.test.redhat.com/1109565
All checks passed

Expected results: rbd images should go back to replaying state post failover and report healthy.


Additional info:

Comment 3 Aman Agrawal 2023-09-19 07:55:42 UTC
Failover was performed from C1 to C2.

Comment 7 Aman Agrawal 2023-09-21 19:08:16 UTC
As confirmed by Ilya in https://chat.google.com/room/AAAAqWkMm2s/wxV5kxtqX1g
nearfull isn't the same as full, so it shouldn't be a problem.

Comment 18 Aman Agrawal 2023-10-19 06:45:57 UTC
Issue wasn't reproduced when testing with

ODF 4.14.0-150.stable
ACM 2.9.0-DOWNSTREAM-2023-10-12-14-53-11
advanced-cluster-management.v2.9.0-187
Submariner brew.registry.redhat.io/rh-osbs/iib:594788
OCP 4.14.0-0.nightly-2023-10-14-061428
ceph version 17.2.6-146.el9cp (1d01c2b30b5fd39787bb8804707c4b2e52e30137) quincy (stable)

Therefore, marking it as verified

Comment 20 errata-xmlrpc 2023-11-08 18:54:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832