Bug 2215982

Summary:	replication not happening and "rbd mirror snapshot schedule ls -R" and "rbd mirror snapshot schedule status" do not return result
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Elvir Kuric <ekuric>
Component:	ceph	Assignee:	Ram Raja <rraja>
ceph sub component:	RBD-Mirror	QA Contact:	Elad <ebenahar>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	bniver, idryomov, kmanohar, muagarwa, ocs-bugs, odf-bz-bot, sostapov
Version:	4.13
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-07-11 12:55:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Elvir Kuric 2023-06-19 14:58:35 UTC

Description of problem (please be detailed as possible and provide log
snippests):

with ceph version 17.2.6-70.0.TEST.bz2119217.el9cp ( this is test image ) we use for snaptrim bz verification ( bz2119217 ) we see that 

"rbd mirror snapshot schedule ls -R" and "rbd mirror snapshot schedule status" fails. 

Error message: 

$ rbd mirror snapshot schedule ls -R  
rbd: rbd mirror snapshot schedule list failed: (11) Resource temporarily unavailable
$ rbd mirror snapshot schedule status
rbd: rbd mirror snapshot schedule status failed: (11) Resource temporarily unavailable
rbd: invalid schedule status JSON received

At same time on same cluster : 

$   rbd -p ocs-storagecluster-cephblockpool mirror pool status
health: OK
daemon health: OK
image health: OK
images: 100 total
    100 replaying



Version of all relevant components (if applicable):


 oc rsh -n openshift-storage $TOOLS_POD
sh-5.1$ ceph versions
{
    "mon": {
        "ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)": 3
    },
    "mgr": {
        "ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)": 1
    },
    "osd": {
        "ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)": 21
    },
    "mds": {
        "ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)": 2
    },
    "rbd-mirror": {
        "ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)": 1
    },
    "rgw": {
        "ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)": 1
    },
    "overall": {
        "ceph version 17.2.6-70.0.TEST.bz2119217.el9cp (6d74fefa15d1216867d1d112b47bb83c4913d28f) quincy (stable)": 29
    }
}

OCP version :

 get clusterversion
NAME      VERSION       AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-rc.5   True     


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes



Is there any workaround available to the best of your knowledge?
NA


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
yes, got it twice ( out of two runs ) on my 10h running tests 

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:

NA
Steps to Reproduce:
1. Create ODF DR with above ceph version ( or upgrade ) 
2. create 100 pods with one pvc per pod, write 10 GB per pod ( fio randrw ) with 70% write, 30 % read and --runtime=36000  ( 10h ) 
3. leave test running for 10-15 hours, after some time "rbd mirror snapshot schedule ls -R" and "rbd mirror snapshot schedule status" will not work. 


Actual results:
below queries fails:
"rbd mirror snapshot schedule ls -R" and "rbd mirror snapshot schedule status"

must gather from cluster1/cluster2 ( ocp / odf ) 

http://perf148b.perf.lab.eng.bos.redhat.com/bz/bz-snapshot-schedule-not/
this cluster has enabled "ceph config set mgr mgr/rbd_support/log_level debug" 


seems that replication is not happening when issue is present, check below "ceph df" outputs from cluster1 and cluster2. 

cluster1: 

 ceph df 
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    18 TiB  11 TiB  6.9 TiB   6.9 TiB      37.58
TOTAL  18 TiB  11 TiB  6.9 TiB   6.9 TiB      37.58
 
--- POOLS ---
POOL                                                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                                                    1    1   38 MiB       11  113 MiB      0    1.9 TiB
ocs-storagecluster-cephblockpool                        2  512  4.9 TiB    1.35M  6.8 TiB  53.85    1.9 TiB
ocs-storagecluster-cephobjectstore.rgw.otp              3    8      0 B        0      0 B      0    1.9 TiB
ocs-storagecluster-cephobjectstore.rgw.control          4    8      0 B        8      0 B      0    1.9 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.index    5    8  8.0 KiB       11   24 KiB      0    1.9 TiB
.rgw.root                                               6    8  5.7 KiB       16  180 KiB      0    1.9 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec   7    8      0 B        0      0 B      0    1.9 TiB
ocs-storagecluster-cephobjectstore.rgw.log              8    8  1.2 MiB      340  5.4 MiB      0    1.9 TiB
ocs-storagecluster-cephobjectstore.rgw.meta             9    8  7.8 KiB       14  136 KiB      0    1.9 TiB
ocs-storagecluster-cephfilesystem-metadata             10   16   15 MiB       27   45 MiB      0    1.9 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.data    11   32    1 KiB        1   12 KiB      0    1.9 TiB
ocs-storagecluster-cephfilesystem-data0                12   32      0 B        0      0 B      0    1.9 TiB


cluster2: 

ceph df
 --- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    18 TiB  15 TiB  3.4 TiB   3.4 TiB      18.72
TOTAL  18 TiB  15 TiB  3.4 TiB   3.4 TiB      18.72
 
--- POOLS ---
POOL                                                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr                                                    1    1   40 MiB       11  120 MiB      0    4.0 TiB
ocs-storagecluster-cephblockpool                        2  512  2.1 TiB  548.12k  3.4 TiB  22.10    4.0 TiB
ocs-storagecluster-cephobjectstore.rgw.control          3    8      0 B        8      0 B      0    4.0 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.index    4    8  9.6 KiB       11   29 KiB      0    4.0 TiB
ocs-storagecluster-cephobjectstore.rgw.log              5    8  1.3 MiB      340  5.7 MiB      0    4.0 TiB
ocs-storagecluster-cephobjectstore.rgw.meta             6    8   10 KiB       14  144 KiB      0    4.0 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec   7    8      0 B        0      0 B      0    4.0 TiB
.rgw.root                                               8    8  5.7 KiB       16  180 KiB      0    4.0 TiB
ocs-storagecluster-cephobjectstore.rgw.otp              9    8      0 B        0      0 B      0    4.0 TiB
ocs-storagecluster-cephfilesystem-metadata             10   16   33 KiB       22  189 KiB      0    4.0 TiB
ocs-storagecluster-cephobjectstore.rgw.buckets.data    11  128    1 KiB        1   12 KiB      0    4.0 TiB
ocs-storagecluster-cephfilesystem-data0                12  128      0 B        0      0 B      0    4.0 TiB

Comment 4 Ilya Dryomov 2023-07-11 16:58:13 UTC

*** Bug 2221716 has been marked as a duplicate of this bug. ***