2218009 – [EC] Commvault space reclamation jobs hang. Underneath, the gluster volume mount hangs.

Bug 2218009 - [EC] Commvault space reclamation jobs hang. Underneath, the gluster volume mount hangs.

Summary: [EC] Commvault space reclamation jobs hang. Underneath, the gluster volume mo...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	disperse
Sub Component:
Version:	rhgs-3.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Xavi Hernandez
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-06-27 19:51 UTC by Andrew Robinson
Modified:	2023-09-18 06:48 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-09-18 06:48:39 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Andrew Robinson 2023-06-27 19:51:16 UTC

Before you record your issue, ensure you are using the latest version of Gluster.


Provide version-Release number of selected component (if applicable):

glusterfs-6.0-63.el7rhgs.x86_64
 
Have you searched the Bugzilla archives for same/similar issues reported.

This bug is a fork of bug 2173516 intended to focus on the space reclamation jobs.

Did you run SoS report with Insights tool?.



Have you discovered any workarounds?. 
If not, Read the troubleshooting documentation to help solve your issue.
(https://mojo.redhat.com/groups/gss-gluster (Gluster feature and its troubleshooting)  https://access.redhat.com/articles/1365073 
(Specific debug data that needs to be collected for GlusterFS to help troubleshooting)

Resolving healing is thought to be a workaround, but that hasn't helped in this case.

Please provide the below Mandatory Information:
1 - gluster v <volname> info
2 - gluster v <volname> heal info
3 - gluster v <volname> status
4 - Fuse Mount/SMB/nfs-ganesha/OCS ???

In separate comments

Describe the issue:(please be detailed as possible and provide log snippets)
[Provide TimeStamp when the issue is seen]

From the support case description:

We are seeing that during Space Reclamation efforts, the job hangs. This cluster consists of 6 nodes which eraser coding is 8+4:

termxbakhyp01
termxbakhyp02
termxbakhyp03
termxbakhyp04
termxbakhyp05
termxbakhyp06

During the space reclamation job, the job will stop progressing. Example of the last hang:

23175 9490 06/23 11:52:03 3613108 [Controller] Updated progress bytes
23175 9490 06/23 11:52:03 3613108 [Controller] Sending progress to JM for reader :73
23175 9490 06/23 11:52:03 3613108 [Controller] Reporting status for reader [73] before sending stream status. Worker count [38].
23175 9490 06/23 11:52:03 3613108 [Reader_73] Discarding queued chunk list
23175 9490 06/23 11:52:03 3613108 [Reader_73] Destorying reader specifics
23175 9490 06/23 11:52:03 3613108 [Reader_73] Worker Thread is exiting. nAuxCopyErr [0].
23175 91c7 06/23 11:52:03 3613108 [Coordinator] Received JobReplStreamStatusReq from Agent:termxbakhyp02.ternium.techint.net
23175 91c7 06/23 11:52:03 3613108 [Coordinator] Stream ReaderId [73] from Agent [termxbakhyp02.ternium.techint.net] Status : [STREAM_COMPLETED_SUCCESS]
23175 91c7 06/23 11:52:03 3613108 [Coordinator] Freed rcid [8418738] for CopyId:[261]
23175 91cc 06/23 11:53:23 3613108 [Controller] ==================== Controller Current State =======================
23175 91cc 06/23 11:53:23 3613108 [Controller] 	Controller is Waiting for [37] Readers to be processed. Pending Readers List:[80,79,78,77,76,75,72,71,70,69,68,67,66,65,64,63,62,61,60,59,58,57,56,55,54,53,52,51,50,49,48,46,45,44,43,42,41]
23175 91cc 06/23 11:53:23 3613108 [Controller] =====================================================================
23175 952a 06/23 11:53:44 3613108 MRU Cache Hits [0], Missed [0], Overlapped Hits [0]
23175 952a 06/23 11:53:44 3613108 Primary Record Cache Hits [0], Missed [0], Overlapped Hits [0]
23175 952a 06/23 11:53:44 3613108 Total verified Afiles [0], chunks [0]; Bad+InUse chunks [0], Bad+InUse Afiles [0]
23175 91cc 06/23 11:58:24 3613108 [Controller] ==================== Controller Current State =======================
23175 91cc 06/23 11:58:24 3613108 [Controller] 	Controller is Waiting for [37] Readers to be processed. Pending Readers List:[80,79,78,77,76,75,72,71,70,69,68,67,66,65,64,63,62,61,60,59,58,57,56,55,54,53,52,51,50,49,48,46,45,44,43,42,41]
23175 91cc 06/23 11:58:24 3613108 [Controller] =============================================


During these hangs, we usually see that df -h hangs and would get stuck on /ws/glus. Rebooting or restarting glusterd. In the above example, a gluster volume state dump was captured, and strace was ran from start to hang of the job and uploaded to the case. 

Currently only a single pruner MA is set which is node termxbakhyp02. 

Define the value or impact to you or the business:
Because the space reclamation jobs fail, the end customer is running out of space. 

Where are you experiencing this behavior? What environment?
Production backup target and only copy of the data

When does this behavior occur? Frequency? Repeatedly? At certain times?
Daily and multiple times a day.


Is this issue reproducible? If yes, share more details.:


Steps to Reproduce:
1.
2.
3.
Actual results:
 
Expected results:
 
Any Additional info:

Note You need to log in before you can comment on or make changes to this bug.