Bug 2033455

Summary: [RDR] OSD Blocklist entries added during failover and fallback operations prevent rbd-mirror communication
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Jean-Charles Lopez <jelopez>
Component: cephAssignee: Ilya Dryomov <idryomov>
ceph sub component: RBD-Mirror QA Contact: Elad <ebenahar>
Status: ASSIGNED --- Docs Contact:
Severity: high    
Priority: unspecified CC: aclewett, amagrawa, bniver, ebenahar, kramdoss, mmuench, muagarwa, odf-bz-bot, pnataraj, prsurve, sostapov, srangana
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jean-Charles Lopez 2021-12-16 21:26:35 UTC
Description of problem (please be detailed as possible and provide log
snippests):
- Deployed a test application on Cluster 1 via ACM and Ramen
- Failed Over to cluster 2
- Relocated application on Cluster 1
- Failed Over to cluster 2
- Let the application run for the entire night
- Relocated application on Cluster
- Deleted application via ACM and Ramen
- Deployed a test application on Cluster 1 via ACM and Ramen

On the new deployment the RBD images are created on cluster 1 but the mirroring is not happening.

RBD Mirror report daemon_health OK but images in error or unknown status

Version of all relevant components (if applicable):
OCP 4.9
ODF 4.9.1 Build 252


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?
Yes
Remove the OSD block list entries from both clusters
Restart the rbd-mirror pod on each cluster

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
5

Can this issue reproducible?
Unsure at this point

Can this issue reproduce from the UI?
No

If this is a regression, please provide more details to justify this:
Unsure at this point

Steps to Reproduce:
1.
2.
3.


Actual results:
rbd-mirror report errors
RBD images are created on cluster 1
RBD images are NOT created on cluster 2


Expected results:
rbd-mirror report health ok
RBD images are created on cluster 1
RBD images are created on cluster 2

Additional info:
We identified the problem through the rbd-mirror log
debug 2021-12-16T17:11:47.801+0000 7f4fc10b4700 -1 rbd::mirror::InstanceReplayer: 0x55f75227a140 start_image_replayer: global_image_id=c1cf27bc-4046-4269-a4e3-52404211f945: blocklisted detected during image replay

The error shows in both cluster

The RBD image status show the following
sh-4.4$ rbd mirror image status ocs-storagecluster-cephblockpool/csi-vol-5e5431d9-5e90-11ec-ad05-0a580a83001c
csi-vol-5e5431d9-5e90-11ec-ad05-0a580a83001c:
  global_id:   599661e3-4255-40c3-8b9a-50be309b7cd0
  state:       up+stopped
  description: local image is primary
  last_update: 2021-12-16 17:09:09
  peer_sites:
    name: 93712e2c-0253-4dae-914e-6418b0df74bb
    state: down+unknown
    description: status not found
    last_update: 
  snapshots:
    3380 .mirror.primary.599661e3-4255-40c3-8b9a-50be309b7cd0.e6856604-763a-4072-9316-e16e9df07cc9 (peer_uuids:[7d3d7527-9ed2-49e4-8b9a-6fa791c8ae84])
    3395 .mirror.primary.599661e3-4255-40c3-8b9a-50be309b7cd0.03e0ee1e-0116-43f1-880c-02de74518869 (peer_uuids:[7d3d7527-9ed2-49e4-8b9a-6fa791c8ae84])
    3417 .mirror.primary.599661e3-4255-40c3-8b9a-50be309b7cd0.4fba24c5-8963-4d02-a186-82c7590c8067 (peer_uuids:[7d3d7527-9ed2-49e4-8b9a-6fa791c8ae84])
sh-4.4$ rbd mirror image status ocs-storagecluster-cephblockpool/csi-vol-5e5ba8b6-5e90-11ec-ad05-0a580a83001c
csi-vol-5e5ba8b6-5e90-11ec-ad05-0a580a83001c:
  global_id:   0f2ca8df-4d22-4261-aec7-fa5705d11f0d
  state:       up+stopped
  description: local image is primary
  service:     a on ip-10-0-198-218.us-east-2.compute.internal
  last_update: 2021-12-16 17:09:11
  peer_sites:
    name: 93712e2c-0253-4dae-914e-6418b0df74bb
    state: down+unknown
    description: status not found
    last_update: 
  snapshots:
    3385 .mirror.primary.0f2ca8df-4d22-4261-aec7-fa5705d11f0d.2aad366c-ba10-41d1-bafe-5c1c86185b59 (peer_uuids:[7d3d7527-9ed2-49e4-8b9a-6fa791c8ae84])
    3393 .mirror.primary.0f2ca8df-4d22-4261-aec7-fa5705d11f0d.8c2c90ed-dd39-4820-905b-f5c67f240923 (peer_uuids:[7d3d7527-9ed2-49e4-8b9a-6fa791c8ae84])
    3418 .mirror.primary.0f2ca8df-4d22-4261-aec7-fa5705d11f0d.f5f6011c-a43c-4a84-afe8-ca40383d445a (peer_uuids:[7d3d7527-9ed2-49e4-8b9a-6fa791c8ae84])

Comment 7 Mudit Agarwal 2022-06-29 13:34:19 UTC
Not a TP blocker, moving it out of 4.11

Comment 24 Mudit Agarwal 2023-04-06 12:44:31 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=2034283 is moved to 4.14