Before you record your issue, ensure you are using the latest version of Gluster. Provide version-Release number of selected component (if applicable): glusterfs-6.0-63.el7rhgs.x86_64 Have you searched the Bugzilla archives for same/similar issues reported. This bug is a fork of bug 2173516 intended to focus on the space reclamation jobs. Did you run SoS report with Insights tool?. Have you discovered any workarounds?. If not, Read the troubleshooting documentation to help solve your issue. (https://mojo.redhat.com/groups/gss-gluster (Gluster feature and its troubleshooting) https://access.redhat.com/articles/1365073 (Specific debug data that needs to be collected for GlusterFS to help troubleshooting) Resolving healing is thought to be a workaround, but that hasn't helped in this case. Please provide the below Mandatory Information: 1 - gluster v <volname> info 2 - gluster v <volname> heal info 3 - gluster v <volname> status 4 - Fuse Mount/SMB/nfs-ganesha/OCS ??? In separate comments Describe the issue:(please be detailed as possible and provide log snippets) [Provide TimeStamp when the issue is seen] From the support case description: We are seeing that during Space Reclamation efforts, the job hangs. This cluster consists of 6 nodes which eraser coding is 8+4: termxbakhyp01 termxbakhyp02 termxbakhyp03 termxbakhyp04 termxbakhyp05 termxbakhyp06 During the space reclamation job, the job will stop progressing. Example of the last hang: 23175 9490 06/23 11:52:03 3613108 [Controller] Updated progress bytes 23175 9490 06/23 11:52:03 3613108 [Controller] Sending progress to JM for reader :73 23175 9490 06/23 11:52:03 3613108 [Controller] Reporting status for reader [73] before sending stream status. Worker count [38]. 23175 9490 06/23 11:52:03 3613108 [Reader_73] Discarding queued chunk list 23175 9490 06/23 11:52:03 3613108 [Reader_73] Destorying reader specifics 23175 9490 06/23 11:52:03 3613108 [Reader_73] Worker Thread is exiting. nAuxCopyErr [0]. 23175 91c7 06/23 11:52:03 3613108 [Coordinator] Received JobReplStreamStatusReq from Agent:termxbakhyp02.ternium.techint.net 23175 91c7 06/23 11:52:03 3613108 [Coordinator] Stream ReaderId [73] from Agent [termxbakhyp02.ternium.techint.net] Status : [STREAM_COMPLETED_SUCCESS] 23175 91c7 06/23 11:52:03 3613108 [Coordinator] Freed rcid [8418738] for CopyId:[261] 23175 91cc 06/23 11:53:23 3613108 [Controller] ==================== Controller Current State ======================= 23175 91cc 06/23 11:53:23 3613108 [Controller] Controller is Waiting for [37] Readers to be processed. Pending Readers List:[80,79,78,77,76,75,72,71,70,69,68,67,66,65,64,63,62,61,60,59,58,57,56,55,54,53,52,51,50,49,48,46,45,44,43,42,41] 23175 91cc 06/23 11:53:23 3613108 [Controller] ===================================================================== 23175 952a 06/23 11:53:44 3613108 MRU Cache Hits [0], Missed [0], Overlapped Hits [0] 23175 952a 06/23 11:53:44 3613108 Primary Record Cache Hits [0], Missed [0], Overlapped Hits [0] 23175 952a 06/23 11:53:44 3613108 Total verified Afiles [0], chunks [0]; Bad+InUse chunks [0], Bad+InUse Afiles [0] 23175 91cc 06/23 11:58:24 3613108 [Controller] ==================== Controller Current State ======================= 23175 91cc 06/23 11:58:24 3613108 [Controller] Controller is Waiting for [37] Readers to be processed. Pending Readers List:[80,79,78,77,76,75,72,71,70,69,68,67,66,65,64,63,62,61,60,59,58,57,56,55,54,53,52,51,50,49,48,46,45,44,43,42,41] 23175 91cc 06/23 11:58:24 3613108 [Controller] ============================================= During these hangs, we usually see that df -h hangs and would get stuck on /ws/glus. Rebooting or restarting glusterd. In the above example, a gluster volume state dump was captured, and strace was ran from start to hang of the job and uploaded to the case. Currently only a single pruner MA is set which is node termxbakhyp02. Define the value or impact to you or the business: Because the space reclamation jobs fail, the end customer is running out of space. Where are you experiencing this behavior? What environment? Production backup target and only copy of the data When does this behavior occur? Frequency? Repeatedly? At certain times? Daily and multiple times a day. Is this issue reproducible? If yes, share more details.: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Any Additional info: