Bug 2218009

Summary:	[EC] Commvault space reclamation jobs hang. Underneath, the gluster volume mount hangs.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Andrew Robinson <anrobins>
Component:	disperse	Assignee:	Xavi Hernandez <jahernan>
Status:	CLOSED NOTABUG	QA Contact:
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.5	CC:	jahernan, olim, rafrojas, spamecha
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-09-18 06:48:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Andrew Robinson 2023-06-27 19:51:16 UTC

Before you record your issue, ensure you are using the latest version of Gluster.


Provide version-Release number of selected component (if applicable):

glusterfs-6.0-63.el7rhgs.x86_64
 
Have you searched the Bugzilla archives for same/similar issues reported.

This bug is a fork of bug 2173516 intended to focus on the space reclamation jobs.

Did you run SoS report with Insights tool?.



Have you discovered any workarounds?. 
If not, Read the troubleshooting documentation to help solve your issue.
(https://mojo.redhat.com/groups/gss-gluster (Gluster feature and its troubleshooting)  https://access.redhat.com/articles/1365073 
(Specific debug data that needs to be collected for GlusterFS to help troubleshooting)

Resolving healing is thought to be a workaround, but that hasn't helped in this case.

Please provide the below Mandatory Information:
1 - gluster v <volname> info
2 - gluster v <volname> heal info
3 - gluster v <volname> status
4 - Fuse Mount/SMB/nfs-ganesha/OCS ???

In separate comments

Describe the issue:(please be detailed as possible and provide log snippets)
[Provide TimeStamp when the issue is seen]

From the support case description:

We are seeing that during Space Reclamation efforts, the job hangs. This cluster consists of 6 nodes which eraser coding is 8+4:

termxbakhyp01
termxbakhyp02
termxbakhyp03
termxbakhyp04
termxbakhyp05
termxbakhyp06

During the space reclamation job, the job will stop progressing. Example of the last hang:

23175 9490 06/23 11:52:03 3613108 [Controller] Updated progress bytes
23175 9490 06/23 11:52:03 3613108 [Controller] Sending progress to JM for reader :73
23175 9490 06/23 11:52:03 3613108 [Controller] Reporting status for reader [73] before sending stream status. Worker count [38].
23175 9490 06/23 11:52:03 3613108 [Reader_73] Discarding queued chunk list
23175 9490 06/23 11:52:03 3613108 [Reader_73] Destorying reader specifics
23175 9490 06/23 11:52:03 3613108 [Reader_73] Worker Thread is exiting. nAuxCopyErr [0].
23175 91c7 06/23 11:52:03 3613108 [Coordinator] Received JobReplStreamStatusReq from Agent:termxbakhyp02.ternium.techint.net
23175 91c7 06/23 11:52:03 3613108 [Coordinator] Stream ReaderId [73] from Agent [termxbakhyp02.ternium.techint.net] Status : [STREAM_COMPLETED_SUCCESS]
23175 91c7 06/23 11:52:03 3613108 [Coordinator] Freed rcid [8418738] for CopyId:[261]
23175 91cc 06/23 11:53:23 3613108 [Controller] ==================== Controller Current State =======================
23175 91cc 06/23 11:53:23 3613108 [Controller] 	Controller is Waiting for [37] Readers to be processed. Pending Readers List:[80,79,78,77,76,75,72,71,70,69,68,67,66,65,64,63,62,61,60,59,58,57,56,55,54,53,52,51,50,49,48,46,45,44,43,42,41]
23175 91cc 06/23 11:53:23 3613108 [Controller] =====================================================================
23175 952a 06/23 11:53:44 3613108 MRU Cache Hits [0], Missed [0], Overlapped Hits [0]
23175 952a 06/23 11:53:44 3613108 Primary Record Cache Hits [0], Missed [0], Overlapped Hits [0]
23175 952a 06/23 11:53:44 3613108 Total verified Afiles [0], chunks [0]; Bad+InUse chunks [0], Bad+InUse Afiles [0]
23175 91cc 06/23 11:58:24 3613108 [Controller] ==================== Controller Current State =======================
23175 91cc 06/23 11:58:24 3613108 [Controller] 	Controller is Waiting for [37] Readers to be processed. Pending Readers List:[80,79,78,77,76,75,72,71,70,69,68,67,66,65,64,63,62,61,60,59,58,57,56,55,54,53,52,51,50,49,48,46,45,44,43,42,41]
23175 91cc 06/23 11:58:24 3613108 [Controller] =============================================


During these hangs, we usually see that df -h hangs and would get stuck on /ws/glus. Rebooting or restarting glusterd. In the above example, a gluster volume state dump was captured, and strace was ran from start to hang of the job and uploaded to the case. 

Currently only a single pruner MA is set which is node termxbakhyp02. 

Define the value or impact to you or the business:
Because the space reclamation jobs fail, the end customer is running out of space. 

Where are you experiencing this behavior? What environment?
Production backup target and only copy of the data

When does this behavior occur? Frequency? Repeatedly? At certain times?
Daily and multiple times a day.


Is this issue reproducible? If yes, share more details.:


Steps to Reproduce:
1.
2.
3.
Actual results:
 
Expected results:
 
Any Additional info: