Bug 1311354

Summary: File operation hangs in 26 node cluster under heavy load
Product: [Community] GlusterFS Reporter: wymonsoon
Component: fuseAssignee: Raghavendra G <rgowdapp>
Status: CLOSED EOL QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.5.5CC: bugs, smohan
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-17 15:57:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
client side log none

Description wymonsoon 2016-02-24 02:28:33 UTC
Created attachment 1130004 [details]
client side log

Description of problem:
We are using GlusterFS 3.5.5.
The server-end is deployed on a 26-node cluster.  Each node has one brick.
The client-end is a 32-node cluster (including the 26 server node) which runs distributed video transcoding.  GFS is the file share between the 32 servers, mounted with FUSE.

We found that when workload is high, client often hangs on file operations on gfs.  The client log indicates that the client losts ping from the server and leads to a bunch of "Transport point is not connected" in the log.

Version-Release number of selected component (if applicable):
3.5.5

How reproducible:
Like dozens of times per hour


Steps to Reproduce:
1.
2.
3.

Actual results:
All file operations runs correctly

Expected results:
No hang


Additional info:
OS: debian 8.2
Kernel:  3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u3 (2015-08-04) x86_64 GNU/Linux

The TCP ping during the hang period is working correctly.

Our volume info:
Volume Name: hzsq_encode_02
Type: Distributed-Replicate
Volume ID: 653b554b-47aa-4f25-a102-7ac6858f41e1
Status: Started
Number of Bricks: 13 x 2 = 26
Transport-type: tcp
Bricks:
Brick1: hzsq-encode-33:/data/gfs-brk
Brick2: hzsq-encode-34:/data/gfs-brk
Brick3: hzsq-encode-41:/data/gfs-brk
Brick4: hzsq-encode-42:/data/gfs-brk
Brick5: hzsq-encode-43:/data/gfs-brk
Brick6: hzsq-encode-44:/data/gfs-brk
Brick7: hzsq-encode-45:/data/gfs-brk
Brick8: hzsq-encode-46:/data/gfs-brk
Brick9: hzsq-encode-47:/data/gfs-brk
Brick10: hzsq-encode-48:/data/gfs-brk
Brick11: hzsq-encode-49:/data/gfs-brk
Brick12: hzsq-encode-50:/data/gfs-brk
Brick13: hzsq-encode-51:/data/gfs-brk
Brick14: hzsq-encode-52:/data/gfs-brk
Brick15: hzsq-encode-53:/data/gfs-brk
Brick16: hzsq-encode-54:/data/gfs-brk
Brick17: hzsq-encode-55:/data/gfs-brk
Brick18: hzsq-encode-56:/data/gfs-brk
Brick19: hzsq-encode-57:/data/gfs-brk
Brick20: hzsq-encode-58:/data/gfs-brk
Brick21: hzsq-encode-59:/data/gfs-brk
Brick22: hzsq-encode-60:/data/gfs-brk
Brick23: hzsq-encode-61:/data/gfs-brk
Brick24: hzsq-encode-62:/data/gfs-brk
Brick25: hzsq-encode-63:/data/gfs-brk
Brick26: hzsq-encode-64:/data/gfs-brk
Options Reconfigured:
nfs.disable: On
performance.io-thread-count: 32
performance.cache-refresh-timeout: 1
performance.write-behind-window-size: 1MB
performance.cache-size: 128MB
performance.flush-behind: On
server.outstanding-rpc-limit: 0
performance.read-ahead: On
performance.io-cache: On
performance.quick-read: off
nfs.outstanding-rpc-limit: 0
network.ping-timeout: 20
server.statedump-path: /tmp

Comment 1 Niels de Vos 2016-06-17 15:57:04 UTC
This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.