Bug 2135372 - [RDR] RBD promote image is failing with msg can_create_primary_snapshot: cannot rollback rbd: error promoting image to primary
Summary: [RDR] RBD promote image is failing with msg can_create_primary_snapshot: cann...
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Ilya Dryomov
QA Contact: Elad
URL:
Whiteboard:
Depends On: 2080982
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-10-17 12:14 UTC by Pratik Surve
Modified: 2023-08-22 11:59 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)

Description Pratik Surve 2022-10-17 12:14:27 UTC
Description of problem (please be detailed as possible and provide log
snippests):
[RDR] RBD promote image is failing with msg can_create_primary_snapshot: cannot rollback rbd: error promoting image to primary

Version of all relevant components (if applicable):

OCP version:- 4.11.8
ODF version:- 4.11.2-5
CEPH version:- ceph version 16.2.8-84.el8cp (c2980f2fd700e979d41b4bad2939bb90f0fe435c) pacific (stable)


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy RDR cluster
2.Deploy RBD workload
3.Perform Failover and check pod status 


Actual results:

essage:                 failed to promote image "ocs-storagecluster-cephblockpool/csi-vol-816dceee-4a4d-11ed-8778-0a580a80022a" with error: an error (exit status 22) and stderror (2022-10-17T08:19:35.147+0000 7ff534e4c700 -1 librbd::mirror::snapshot::util:  can_create_primary_snapshot: cannot rollback
rbd: error promoting image to primary
2022-10-17T08:19:35.147+0000 7ff534e4c700 -1 librbd::mirror::snapshot::PromoteRequest: 0x7ff51c0044e0 send: cannot promote
2022-10-17T08:19:35.147+0000 7ff534e4c700 -1 librbd::mirror::PromoteRequest: 0x55d6efe93410 handle_promote: failed to promote image: (22) Invalid argument
2022-10-17T08:19:35.147+0000 7ff547814380 -1 librbd::api::Mirror: image_promote: failed to promote image
) occurred while running rbd args: [mirror image promote ocs-storagecluster-cephblockpool/csi-vol-816dceee-4a4d-11ed-8778-0a580a80022a --force --id csi-rbd-provisioner -m 172.30.103.130:6789,172.30.71.74:6789,172.30.200.93:6789 --keyfile=***stripped***]



Events:
  Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  35m (x984 over 2d)     kubelet  Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[mypvc kube-api-access-jltx8]: timed out waiting for the condition
  Warning  FailedMount  20m (x284 over 2d)     kubelet  Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[kube-api-access-jltx8 mypvc]: timed out waiting for the condition
  Warning  FailedMount  5m41s (x1432 over 2d)  kubelet  MountVolume.MountDevice failed for volume "pvc-daf5aba0-070d-48d8-962b-1e0796d435fe" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 6) occurred while running rbd args: [--id csi-rbd-node -m 172.30.103.130:6789,172.30.71.74:6789,172.30.200.93:6789 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-816d7915-4a4d-11ed-8778-0a580a80022a --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed
rbd: map failed: (6) No such device or address


oc get pods       
NAME                       READY   STATUS              RESTARTS   AGE
dd-io-1-7849869c6-2275p    0/1     ContainerCreating   0          2d
dd-io-2-5589c487cd-5zx2q   0/1     ContainerCreating   0          2d
dd-io-3-578757c5d4-sn9xj   0/1     ContainerCreating   0          2d
dd-io-4-6796878474-vsp5j   0/1     ContainerCreating   0          2d
dd-io-5-9b774b9b8-xzpdj    0/1     ContainerCreating   0          2d
dd-io-6-754fd4bcdb-p5tk5   0/1     ContainerCreating   0          2d
dd-io-7-6f88d765b8-c48pp   0/1     ContainerCreating   0          2d


oc get vr                              
NAME          AGE   VOLUMEREPLICATIONCLASS                 PVCNAME       DESIREDSTATE   CURRENTSTATE
dd-io-pvc-1   2d    rbd-volumereplicationclass-473128587   dd-io-pvc-1   primary        Unknown
dd-io-pvc-2   2d    rbd-volumereplicationclass-473128587   dd-io-pvc-2   primary        Unknown
dd-io-pvc-3   2d    rbd-volumereplicationclass-473128587   dd-io-pvc-3   primary        Unknown
dd-io-pvc-4   2d    rbd-volumereplicationclass-473128587   dd-io-pvc-4   primary        Unknown
dd-io-pvc-5   2d    rbd-volumereplicationclass-473128587   dd-io-pvc-5   primary        Unknown
dd-io-pvc-6   2d    rbd-volumereplicationclass-473128587   dd-io-pvc-6   primary        Unknown
dd-io-pvc-7   2d    rbd-volumereplicationclass-473128587   dd-io-pvc-7   primary        Unknown


Expected results:
Failover should work and promote cmd should not fail

Additional info:

Comment 4 Elad 2022-10-26 14:10:48 UTC
Since it seems quite straight forward scenario, proposing as a blocker


Note You need to log in before you can comment on or make changes to this bug.