2305951 – [CephFS - Consistency Group] - Quiesce timedout after 10mins on a member with rc 110 while IO in-progress

Bug 2305951 - [CephFS - Consistency Group] - Quiesce timedout after 10mins on a member with rc 110 while IO in-progress

Summary: [CephFS - Consistency Group] - Quiesce timedout after 10mins on a member with...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	8.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	8.0
Assignee:	Patrick Donnelly
QA Contact:	sumr
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2317218
TreeView+	depends on / blocked

Reported:	2024-08-20 06:21 UTC by sumr
Modified:	2024-11-25 09:06 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ceph-19.1.1-8.el9cp
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-11-25 09:06:19 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-9547	0	None	None	None	2024-08-23 15:52:10 UTC
Red Hat Product Errata	RHBA-2024:10216	0	None	None	None	2024-11-25 09:06:24 UTC

Description sumr 2024-08-20 06:21:23 UTC

Description of problem:
CG quiesce on quiesce set having 10 subvolumes from non-default group, timedout after 10mins with member quiesce timeout on sv_non_def_1 returning 'rc 110'.

Cmd: 
2024-08-17 23:57:50,495 (cephci.snapshot_clone.cg_snap_system_test) [INFO] - cephci.RH.8.0.rhel-9.Weekly.19.1.0-22.cephfs.4.cephci.ceph.ceph.py:1596 - Running command ceph fs quiesce cephfs  "subvolgroup_cg/sv_non_def_1"  "subvolgroup_cg/sv_non_def_2"  "subvolgroup_cg/sv_non_def_3"  "subvolgroup_cg/sv_non_def_4"  "subvolgroup_cg/sv_non_def_5"  "subvolgroup_cg/sv_non_def_6"  "subvolgroup_cg/sv_non_def_7"  "subvolgroup_cg/sv_non_def_8"  "subvolgroup_cg/sv_non_def_9"  "subvolgroup_cg/sv_non_def_10"  --format json  --set-id cg_scale_f6fc  --await  --timeout 600  --expiration 600 on 10.0.195.112 timeout 600

Response:
  File "/home/jenkins/ceph-builds/openstack/RH/8.0/rhel-9/Weekly/19.1.0-22/cephfs/4/cephci/tests/cephfs/snapshot_clone/cg_snap_system_test.py", line 368, in cg_scale
    cg_snap_util.cg_quiesce(
  File "/home/jenkins/ceph-builds/openstack/RH/8.0/rhel-9/Weekly/19.1.0-22/cephfs/4/cephci/tests/cephfs/snapshot_clone/cg_snap_utils.py", line 150, in cg_quiesce
    out, rc = client.exec_command(
  File "/home/jenkins/ceph-builds/openstack/RH/8.0/rhel-9/Weekly/19.1.0-22/cephfs/4/cephci/ceph/ceph.py", line 2226, in exec_command
    return self.node.exec_command(cmd=cmd, **kw)
  File "/home/jenkins/ceph-builds/openstack/RH/8.0/rhel-9/Weekly/19.1.0-22/cephfs/4/cephci/ceph/ceph.py", line 1619, in exec_command
    raise SocketTimeoutException(sock_err)
ceph.ceph.SocketTimeoutException

MDS Debug log snippet:
ceph-mds.cephfs.ceph-weekly-0u6921-lp59i1-node2.jjzygd.log:2024-08-18T03:57:09.618+0000 7f2f33e37640  1 mds.cephfs.ceph-weekly-0u6921-lp59i1-node2.jjzygd asok_command: quiesce db {await=1,expiration=600,format=json,members=[subvolgroup_cg/sv_non_def_1,subvolgroup_cg/sv_non_def_2,subvolgroup_cg/sv_non_def_3,subvolgroup_cg/sv_non_def_4,subvolgroup_cg/sv_non_def_5,subvolgroup_cg/sv_non_def_6,subvolgroup_cg/sv_non_def_7,subvolgroup_cg/sv_non_def_8,subvolgroup_cg/sv_non_def_9,subvolgroup_cg/sv_non_def_10],prefix=quiesce db,roots=[/volumes/subvolgroup_cg/sv_non_def_1/54e6c36f-e555-49da-ab07-2475dc235f0c,/volumes/subvolgroup_cg/sv_non_def_2/c0e146ce-f5d6-433b-8d1c-d3b34073aa4d,/volumes/subvolgroup_cg/sv_non_def_3/f0c14e16-3719-450b-8d63-91916ddf2381,/volumes/subvolgroup_cg/sv_non_def_4/6adbb35d-7a7d-49e1-9b2c-5f73e30ddfc1,/volumes/subvolgroup_cg/sv_non_def_5/4ec557ae-98d2-49b1-bd1a-50dd513c8a94,/volumes/subvolgroup_cg/sv_non_def_6/3c468a22-358b-4c4a-99d6-e021bfc16792,/volumes/subvolgroup_cg/sv_non_def_7/d359a401-a70b-4ecc-942c-58c24f6a11ec,/volumes/subvolgroup_cg/sv_non_def_8/f19885c8-5601-442c-b428-7f67daf81904,/volumes/subvolgroup_cg/sv_non_def_9/d9b65656-a866-47c1-9f38-d5113ba07b79,/volumes/subvolgroup_cg/sv_non_def_10/f75f73dc-dfb6-4cc1-a85c-e7ddf8ab7050],set_id=cg_scale_f6fc,target=[mon-mgr,],timeout=600,vol_name=cephfs} (starting...)






ceph-mds.cephfs.ceph-weekly-0u6921-lp59i1-node2.jjzygd.log:2024-08-18T04:07:09.623+0000 7f2f28e21640 10 quiesce.mgr.24284 <leader_upkeep_set> [cg_scale_f6fc@10,file:/volumes/subvolgroup_cg/sv_non_def_1/54e6c36f-e555-49da-ab07-2475dc235f0c] detected a member quiesce timeout
ceph-mds.cephfs.ceph-weekly-0u6921-lp59i1-node2.jjzygd.log:2024-08-18T04:07:09.623+0000 7f2f28e21640 10 quiesce.mgr.24284 <leader_upkeep_awaits> completing an await for the set 'cg_scale_f6fc' with rc: 110



Version-Release number of selected component (if applicable): 19.1.0-22


How reproducible:


Steps to Reproduce:
1. Create a quiesce set of 10subvolumes
2. Run IO across all subvolumes from 10 different clients
3. While IO in-progress, perform quiesce on quiesce set with timeout and expiration set to 600secs.

Actual results: Quiesce should succeed in given timeout.


Expected results: Quiesce timedout on a member after 10mins.


Additional info:

Automation logs: http://magna002.ceph.redhat.com/cephci-jenkins/results/openstack/RH/8.0/rhel-9/Weekly/19.1.0-22/cephfs/4/tier-1_cephfs_cg_quiesce_systemic/cg_snap_system_test_0.log

MDS and OSD Debug logs at magna002 : /ceph/cephci-jenkins/results/openstack/RH/8.0/rhel-9/Weekly/19.1.0-22/cephfs/4/tier-1_cephfs_cg_quiesce_systemic/ceph_logs/

Please let me know if any additional information required.

Comment 13 errata-xmlrpc 2024-11-25 09:06:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2024:10216

Note You need to log in before you can comment on or make changes to this bug.