Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2406794

Summary: [RDR] A few RBD images report error due to incomplete group snapshots on the secondary cluster after workload deployment
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Aman Agrawal <amagrawa>
Component: RBD-MirrorAssignee: Ilya Dryomov <idryomov>
Status: CLOSED ERRATA QA Contact: Chaitanya <cdommeti>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 8.1CC: ceph-eng-bugs, cephqe-warriors, idryomov, nladha, sangadi, tserlin
Target Milestone: ---   
Target Release: 9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-20.1.0-132 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2416777 (view as bug list) Environment:
Last Closed: 2026-01-29 07:02:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Aman Agrawal 2025-10-28 13:20:20 UTC
Description of problem:


Version-Release number of selected component (if applicable):

OADP 1.5.2
ODF 4.20.0-115.stable
ACM 2.15.0-135
MCE 2.10.0-124
OCP  4.20.0-0.nightly-2025-10-07-014413 
GitOps 1.16.1
Virtulization 4.20.0-207 
Submariner 0.21.0  
ceph version 19.2.1-274.el9cp (3a2f1cec313e6abbd90d9260bd5e0e866817c3c7) squid (stable)


How reproducible: Not sure yet, hitting it for the first time with automation


Steps to Reproduce:
1. Deploy a RBD appset (pull) and subscription busybox workload on a RDR setup and let IO continue after DR protection. In this case, it was run for 10-15mins and the issue was hit.

Workloads were deployed via automation.
Console logs- https://url.corp.redhat.com/f616441

Test- tests/functional/disaster-recovery/regional-dr/test_site_failure_recovery_and_failover.py


Actual results: 

From hub-

 

echo "////////////////////////////////";date -u; echo "*******";oc get drpc -o wide -A
////////////////////////////////
Tue Oct 28 08:54:00 UTC 2025
*******
NAMESPACE             NAME                       AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE   PROGRESSION   START TIME             DURATION        PEER READY
busybox-workloads-1   busybox-placement-drpc     17h   amagrawa-hr-c1                                      Deployed       Completed     2025-10-27T15:38:18Z   26.331031576s   True
openshift-gitops      busybox-1-placement-drpc   17h   amagrawa-hr-c1                                      Deployed       Completed     2025-10-27T15:42:49Z   16.457398556s   True 
 





From C1-


last sync time for both VGRs is:
oc get vgr -A -oyaml | grep lastSyncTime
    lastSyncTime: "2025-10-28T07:10:01Z"
    lastSyncTime: "2025-10-28T07:10:01Z"


 

mirroringStatus:
      lastChecked: "2025-10-28T08:55:40Z"
      summary:
        daemon_health: OK
        group_health: WARNING
        group_states:
          replaying: 1
          stopping_replay: 1
        health: ERROR
        image_health: ERROR
        image_states:
          error: 2
          replaying: 18
        states:
          error: 2
          replaying: 18
    phase: Ready 
 

 

csi-vol-6896405b-b1c5-45eb-bc52-81a1791f5382
csi-vol-6896405b-b1c5-45eb-bc52-81a1791f5382:
  global_id:   619124f6-703a-4a53-863a-5b15cd252c55
  state:       up+stopped
  description: local image is primary
  service:     a on compute-2
  last_update: 2025-10-28 09:23:16
  peer_sites:
    name: 62b4febe-7ab7-4aed-9a88-3f1fd167d3a9
    state: up+error
    description: failed to refresh remote image
    last_update: 2025-10-28 09:22:562:56



csi-vol-68569653-519d-45c9-b4e0-5c70e94463c6
csi-vol-68569653-519d-45c9-b4e0-5c70e94463c6:
  global_id:   3aecb3f7-5823-4a00-bdad-fef5e4795363
  state:       up+stopped
  description: local image is primary
  service:     a on compute-2
  last_update: 2025-10-28 09:23:16
  peer_sites:
    name: 62b4febe-7ab7-4aed-9a88-3f1fd167d3a9
    state: up+error
    description: failed to refresh remote image
    last_update: 2025-10-28 09:22:56 
 


From C2-

 

 mirroringStatus:
      lastChecked: "2025-10-28T08:57:23Z"
      summary:
        daemon_health: OK
        group_health: OK
        group_states:
          replaying: 2
        health: ERROR
        image_health: ERROR
        image_states:
          error: 2
          replaying: 18
        states:
          error: 2
          replaying: 18
    phase: Ready



csi-vol-6896405b-b1c5-45eb-bc52-81a1791f5382
csi-vol-6896405b-b1c5-45eb-bc52-81a1791f5382:
  global_id:   619124f6-703a-4a53-863a-5b15cd252c55
  state:       up+error
  description: failed to refresh remote image
  service:     a on compute-1
  last_update: 2025-10-28 07:15:56
  peer_sites:
    name: e40ca6e1-43bb-4426-aa99-aa9f3d1a7d1b
    state: up+stopped
    description: local image is primary
    last_update: 2025-10-28 09:22:412:57


csi-vol-68569653-519d-45c9-b4e0-5c70e94463c6
csi-vol-68569653-519d-45c9-b4e0-5c70e94463c6:
  global_id:   3aecb3f7-5823-4a00-bdad-fef5e4795363
  state:       up+error
  description: failed to refresh remote image
  service:     a on compute-1
  last_update: 2025-10-28 07:15:56
  peer_sites:
    name: e40ca6e1-43bb-4426-aa99-aa9f3d1a7d1b
    state: up+stopped
    description: local image is primary
    last_update: 2025-10-28 09:22:16 


Expected results: RBD images shouldn't report error due to incomplete group snapshots 


Additional info: Relevant thread- https://ibm-systems-storage.slack.com/archives/C06GQDKEVGT/p1761641896967529

No node reboot, or DR operation was performed. This was on a fresh new setup.

Clusters overall health was okay. 

ODF tracker bug- DFBUGS-4392

Comment 1 Storage PM bot 2025-10-28 13:20:33 UTC
Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 14 errata-xmlrpc 2026-01-29 07:02:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 9.0 Security and Enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2026:1536