2215982 – replication not happening and "rbd mirror snapshot schedule ls -R" and "rbd mirror snapshot schedule status" do not return result

Bug 2215982 - replication not happening and "rbd mirror snapshot schedule ls -R" and "rbd mirror snapshot schedule status" do not return result

Summary: replication not happening and "rbd mirror snapshot schedule ls -R" and "rbd m...

Keywords:
Status:	CLOSED DUPLICATE of bug 2067095
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ram Raja
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2221716 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-06-19 14:58 UTC by Elvir Kuric
Modified:	2023-08-09 16:37 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-07-11 12:55:43 UTC
Embargoed:

Attachments	(Terms of Use)

Description Elvir Kuric 2023-06-19 14:58:35 UTC

Description of problem (please be detailed as possible and provide log
snippests):

with ceph version 17.2.6-70.0.TEST.bz2119217.el9cp ( this is test image ) we use for snaptrim bz verification ( bz2119217 ) we see that 

"rbd mirror snapshot schedule ls -R" and "rbd mirror snapshot schedule status" fails. 

Error message: 

$ rbd mirror snapshot schedule ls -R  
rbd: rbd mirror snapshot schedule list failed: (11) Resource temporarily unavailable
$ rbd mirror snapshot schedule status
rbd: rbd mirror snapshot schedule status failed: (11) Resource temporarily unavailable
rbd: invalid schedule status JSON received

At same time on same cluster : 

$   rbd -p ocs-storagecluster-cephblockpool mirror pool status
health: OK
daemon health: OK
image health: OK
images: 100 total
    100 replaying



Version of all relevant components (if applicable):


 oc rsh -n openshift-storage $TOOLS_POD
sh-5.1$ ceph versions
{
    "mon": {
        "ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)": 3
    },
    "mgr": {
        "ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)": 1
    },
    "osd": {
        "ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)": 21
    },
    "mds": {
        "ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)": 2
    },
    "rbd-mirror": {
        "ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)": 1
    },
    "rgw": {
        "ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)": 1
    },
    "overall": {
        "ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)": 29
    }
}

OCP version :

 get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-rc.5   True     


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes



Is there any workaround available to the best of your knowledge?
NA


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
yes, got it twice ( out of two runs ) on my 10h running tests 

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:

NA
Steps to Reproduce:
1. Create ODF DR with above ceph version ( or upgrade ) 
2. create 100 pods with one pvc per pod, write 10 GB per pod ( fio randrw ) with 70% write, 30 % read and --runtime=36000  ( 10h ) 
3. leave test running for 10-15 hours, after some time "rbd mirror snapshot schedule ls -R" and "rbd mirror snapshot schedule status" will not work. 


Actual results:
below queries fails:
"rbd mirror snapshot schedule ls -R" and "rbd mirror snapshot schedule status"

must gather from cluster1/cluster2 ( ocp / odf ) 

http://perf148b.perf.lab.eng.bos.redhat.com/bz/bz-snapshot-schedule-not/
this cluster has enabled "ceph config set mgr mgr/rbd_support/log_level debug" 


seems that replication is not happening when issue is present, check below "ceph df" outputs from cluster1 and cluster2. 

cluster1: 

 ceph df 
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    18 TiB  11 TiB  6.9 TiB   6.9 TiB      37.58
TOTAL  18 TiB  11 TiB  6.9 TiB   6.9 TiB      37.58
 
--- POOLS ---
POOL                                                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                                                    1    1   38 MiB       11  113 MiB      0    1.9 TiB
ocs-storagecluster-cephblockpool                        2  512  4.9 TiB    1.35M  6.8 TiB  53.85    1.9 TiB
ocs-storagecluster-cephobjectstore.rgw.otp              3    8      0 B        0      0 B      0    1.9 TiB
ocs-storagecluster-cephobjectstore.rgw.control          4    8      0 B        8      0 B      0    1.9 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.index    5    8  8.0 KiB       11   24 KiB      0    1.9 TiB
.rgw.root                                               6    8  5.7 KiB       16  180 KiB      0    1.9 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec   7    8      0 B        0      0 B      0    1.9 TiB
ocs-storagecluster-cephobjectstore.rgw.log              8    8  1.2 MiB      340  5.4 MiB      0    1.9 TiB
ocs-storagecluster-cephobjectstore.rgw.meta             9    8  7.8 KiB       14  136 KiB      0    1.9 TiB
ocs-storagecluster-cephfilesystem-metadata             10   16   15 MiB       27   45 MiB      0    1.9 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.data    11   32    1 KiB        1   12 KiB      0    1.9 TiB
ocs-storagecluster-cephfilesystem-data0                12   32      0 B        0      0 B      0    1.9 TiB


cluster2: 

ceph df
 --- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    18 TiB  15 TiB  3.4 TiB   3.4 TiB      18.72
TOTAL  18 TiB  15 TiB  3.4 TiB   3.4 TiB      18.72
 
--- POOLS ---
POOL                                                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                                                    1    1   40 MiB       11  120 MiB      0    4.0 TiB
ocs-storagecluster-cephblockpool                        2  512  2.1 TiB  548.12k  3.4 TiB  22.10    4.0 TiB
ocs-storagecluster-cephobjectstore.rgw.control          3    8      0 B        8      0 B      0    4.0 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.index    4    8  9.6 KiB       11   29 KiB      0    4.0 TiB
ocs-storagecluster-cephobjectstore.rgw.log              5    8  1.3 MiB      340  5.7 MiB      0    4.0 TiB
ocs-storagecluster-cephobjectstore.rgw.meta             6    8   10 KiB       14  144 KiB      0    4.0 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec   7    8      0 B        0      0 B      0    4.0 TiB
.rgw.root                                               8    8  5.7 KiB       16  180 KiB      0    4.0 TiB
ocs-storagecluster-cephobjectstore.rgw.otp              9    8      0 B        0      0 B      0    4.0 TiB
ocs-storagecluster-cephfilesystem-metadata             10   16   33 KiB       22  189 KiB      0    4.0 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.data    11  128    1 KiB        1   12 KiB      0    4.0 TiB
ocs-storagecluster-cephfilesystem-data0                12  128      0 B        0      0 B      0    4.0 TiB

Comment 4 Ilya Dryomov 2023-07-11 16:58:13 UTC

*** Bug 2221716 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.