Bug 1311354 - File operation hangs in 26 node cluster under heavy load
File operation hangs in 26 node cluster under heavy load
Product: GlusterFS
Classification: Community
Component: fuse (Show other bugs)
x86_64 Linux
unspecified Severity urgent
: ---
: ---
Assigned To: Raghavendra G
: Triaged
Depends On:
  Show dependency treegraph
Reported: 2016-02-23 21:28 EST by wymonsoon
Modified: 2016-06-17 11:57 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2016-06-17 11:57:04 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
client side log (1.57 KB, text/plain)
2016-02-23 21:28 EST, wymonsoon
no flags Details

  None (edit)
Description wymonsoon 2016-02-23 21:28:33 EST
Created attachment 1130004 [details]
client side log

Description of problem:
We are using GlusterFS 3.5.5.
The server-end is deployed on a 26-node cluster.  Each node has one brick.
The client-end is a 32-node cluster (including the 26 server node) which runs distributed video transcoding.  GFS is the file share between the 32 servers, mounted with FUSE.

We found that when workload is high, client often hangs on file operations on gfs.  The client log indicates that the client losts ping from the server and leads to a bunch of "Transport point is not connected" in the log.

Version-Release number of selected component (if applicable):

How reproducible:
Like dozens of times per hour

Steps to Reproduce:

Actual results:
All file operations runs correctly

Expected results:
No hang

Additional info:
OS: debian 8.2
Kernel:  3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u3 (2015-08-04) x86_64 GNU/Linux

The TCP ping during the hang period is working correctly.

Our volume info:
Volume Name: hzsq_encode_02
Type: Distributed-Replicate
Volume ID: 653b554b-47aa-4f25-a102-7ac6858f41e1
Status: Started
Number of Bricks: 13 x 2 = 26
Transport-type: tcp
Brick1: hzsq-encode-33:/data/gfs-brk
Brick2: hzsq-encode-34:/data/gfs-brk
Brick3: hzsq-encode-41:/data/gfs-brk
Brick4: hzsq-encode-42:/data/gfs-brk
Brick5: hzsq-encode-43:/data/gfs-brk
Brick6: hzsq-encode-44:/data/gfs-brk
Brick7: hzsq-encode-45:/data/gfs-brk
Brick8: hzsq-encode-46:/data/gfs-brk
Brick9: hzsq-encode-47:/data/gfs-brk
Brick10: hzsq-encode-48:/data/gfs-brk
Brick11: hzsq-encode-49:/data/gfs-brk
Brick12: hzsq-encode-50:/data/gfs-brk
Brick13: hzsq-encode-51:/data/gfs-brk
Brick14: hzsq-encode-52:/data/gfs-brk
Brick15: hzsq-encode-53:/data/gfs-brk
Brick16: hzsq-encode-54:/data/gfs-brk
Brick17: hzsq-encode-55:/data/gfs-brk
Brick18: hzsq-encode-56:/data/gfs-brk
Brick19: hzsq-encode-57:/data/gfs-brk
Brick20: hzsq-encode-58:/data/gfs-brk
Brick21: hzsq-encode-59:/data/gfs-brk
Brick22: hzsq-encode-60:/data/gfs-brk
Brick23: hzsq-encode-61:/data/gfs-brk
Brick24: hzsq-encode-62:/data/gfs-brk
Brick25: hzsq-encode-63:/data/gfs-brk
Brick26: hzsq-encode-64:/data/gfs-brk
Options Reconfigured:
nfs.disable: On
performance.io-thread-count: 32
performance.cache-refresh-timeout: 1
performance.write-behind-window-size: 1MB
performance.cache-size: 128MB
performance.flush-behind: On
server.outstanding-rpc-limit: 0
performance.read-ahead: On
performance.io-cache: On
performance.quick-read: off
nfs.outstanding-rpc-limit: 0
network.ping-timeout: 20
server.statedump-path: /tmp
Comment 1 Niels de Vos 2016-06-17 11:57:04 EDT
This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.

Note You need to log in before you can comment on or make changes to this bug.