1311354 – File operation hangs in 26 node cluster under heavy load

Bug 1311354 - File operation hangs in 26 node cluster under heavy load

Summary: File operation hangs in 26 node cluster under heavy load

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	fuse
Sub Component:
Version:	3.5.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Raghavendra G
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-02-24 02:28 UTC by wymonsoon
Modified:	2016-06-17 15:57 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-06-17 15:57:04 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
client side log (1.57 KB, text/plain) 2016-02-24 02:28 UTC, wymonsoon	no flags	Details
View All

Description wymonsoon 2016-02-24 02:28:33 UTC

Created attachment 1130004 [details]
client side log

Description of problem:
We are using GlusterFS 3.5.5.
The server-end is deployed on a 26-node cluster.  Each node has one brick.
The client-end is a 32-node cluster (including the 26 server node) which runs distributed video transcoding.  GFS is the file share between the 32 servers, mounted with FUSE.

We found that when workload is high, client often hangs on file operations on gfs.  The client log indicates that the client losts ping from the server and leads to a bunch of "Transport point is not connected" in the log.

Version-Release number of selected component (if applicable):
3.5.5

How reproducible:
Like dozens of times per hour


Steps to Reproduce:
1.
2.
3.

Actual results:
All file operations runs correctly

Expected results:
No hang


Additional info:
OS: debian 8.2
Kernel:  3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u3 (2015-08-04) x86_64 GNU/Linux

The TCP ping during the hang period is working correctly.

Our volume info:
Volume Name: hzsq_encode_02
Type: Distributed-Replicate
Volume ID: 653b554b-47aa-4f25-a102-7ac6858f41e1
Status: Started
Number of Bricks: 13 x 2 = 26
Transport-type: tcp
Bricks:
Brick1: hzsq-encode-33:/data/gfs-brk
Brick2: hzsq-encode-34:/data/gfs-brk
Brick3: hzsq-encode-41:/data/gfs-brk
Brick4: hzsq-encode-42:/data/gfs-brk
Brick5: hzsq-encode-43:/data/gfs-brk
Brick6: hzsq-encode-44:/data/gfs-brk
Brick7: hzsq-encode-45:/data/gfs-brk
Brick8: hzsq-encode-46:/data/gfs-brk
Brick9: hzsq-encode-47:/data/gfs-brk
Brick10: hzsq-encode-48:/data/gfs-brk
Brick11: hzsq-encode-49:/data/gfs-brk
Brick12: hzsq-encode-50:/data/gfs-brk
Brick13: hzsq-encode-51:/data/gfs-brk
Brick14: hzsq-encode-52:/data/gfs-brk
Brick15: hzsq-encode-53:/data/gfs-brk
Brick16: hzsq-encode-54:/data/gfs-brk
Brick17: hzsq-encode-55:/data/gfs-brk
Brick18: hzsq-encode-56:/data/gfs-brk
Brick19: hzsq-encode-57:/data/gfs-brk
Brick20: hzsq-encode-58:/data/gfs-brk
Brick21: hzsq-encode-59:/data/gfs-brk
Brick22: hzsq-encode-60:/data/gfs-brk
Brick23: hzsq-encode-61:/data/gfs-brk
Brick24: hzsq-encode-62:/data/gfs-brk
Brick25: hzsq-encode-63:/data/gfs-brk
Brick26: hzsq-encode-64:/data/gfs-brk
Options Reconfigured:
nfs.disable: On
performance.io-thread-count: 32
performance.cache-refresh-timeout: 1
performance.write-behind-window-size: 1MB
performance.cache-size: 128MB
performance.flush-behind: On
server.outstanding-rpc-limit: 0
performance.read-ahead: On
performance.io-cache: On
performance.quick-read: off
nfs.outstanding-rpc-limit: 0
network.ping-timeout: 20
server.statedump-path: /tmp

Comment 1 Niels de Vos 2016-06-17 15:57:04 UTC

This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.

Note You need to log in before you can comment on or make changes to this bug.