1043009 – gluster fails under heavy array job load load

Bug 1043009 - gluster fails under heavy array job load load

Summary: gluster fails under heavy array job load load

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	unclassified
Sub Component:
Version:	3.4.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-12-13 17:49 UTC by Harry Mangalam
Modified:	2015-10-07 12:12 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-10-07 12:12:21 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Harry Mangalam 2013-12-13 17:49:26 UTC

Previously posted to the gluster list; full logs for 1 server here: <http://goo.gl/cSblQL> and several clients at: <http://goo.gl/6IYJZQ>

Description of problem:
Short version: Our gluster fs (~340TB) provides scratch space for a ~5000core academic compute cluster.
Much of our load is streaming IO, doing a lot of genomics work, and that is the load under which we saw this latest failure.
Under heavy batch load, especially array jobs, where there might be several 64core nodes doing I/O on the 4servers/8bricks, we often get job failures that have the following profile:
 
Client POV:
Here is a sampling of the client logs (/var/log/glusterfs/gl.log) for all compute nodes that indicated interaction with the user's files
<http://pastie.org/8548781>
 
Here are some client Info logs that seem fairly serious:
<http://pastie.org/8548785>
 
The errors that referenced this user were gathered from all the nodes that were running his code (in compute*) and agglomerated with:
 
cut -f2,3 -d']' compute* |cut -f1 -dP | sort | uniq -c | sort -gr
 
and placed here to show the profile of errors that his run generated.
<http://pastie.org/8548796>
 
so 71 of them were:
W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-7: remote operation failed: Transport endpoint is not connected.
etc
 
We've seen this before and previously discounted it bc it seems to have been related to the problem of spurious NFS-related bugs, but now I'm wondering whether it's a real problem.
Also the 'remote operation failed: Stale file handle. ' warnings.
 
There were no Errors logged per se, tho some of the W's looked fairly nasty, like the 'dht_layout_dir_mismatch'
 
From the server side, however, during the same period, there were:
0 Warnings about this user's files
0 Errors
458 Info lines
of which only 1 line was not a 'cleanup' line like this:
---
10.2.7.11:[2013-12-12 21:22:01.064289] I [server-helpers.c:460:do_fd_cleanup] 0-gl-server: fd cleanup on /path/to/file
---
it was:
---
10.2.7.14:[2013-12-12 21:00:35.209015] I [server-rpc-fops.c:898:_gf_server_log_setxattr_failure] 0-gl-server: 113697332: SETXATTR /bio/tdlong/RNAseqIII/ckpt.1084030 (c9488341-c063-4175-8492-75e2e282f690) ==> trusted.glusterfs.dht
---
 
We're losing about 10% of these kinds of array jobs bc of this, which is just not supportable.
 
 
 
Gluster details
 
servers and clients running gluster 3.4.0-8.el6 over QDR IB, IPoIB, thru 2 Mellanox, 1 Voltaire switches, Mellanox cards, CentOS 6.4
 
$ gluster volume info
Volume Name: gl
Type: Distribute
Volume ID: 21f480f7-fc5a-4fd8-a084-3964634a9332
Status: Started
Number of Bricks: 8
Transport-type: tcp,rdma
Bricks:
Brick1: bs2:/raid1
Brick2: bs2:/raid2
Brick3: bs3:/raid1
Brick4: bs3:/raid2
Brick5: bs4:/raid1
Brick6: bs4:/raid2
Brick7: bs1:/raid1
Brick8: bs1:/raid2
Options Reconfigured:
performance.write-behind-window-size: 1024MB
performance.flush-behind: on
performance.cache-size: 268435456
nfs.disable: on
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
auth.allow: 10.2.*.*,10.1.*.*
 
 
'gluster volume status gl detail':
<http://pastie.org/8548826>
 

Version-Release number of selected component (if applicable):

3.4.0-8.el6

How reproducible:
Almost every IO-heavy SGE array job is losing array elements due to inability of gluster to keep up with the IO. Up to 10% of the jobs are failing.  We have recently added Berkeley Checkpointing ability, which imposes extra IO on the system and the job  loss seems to have accelerated with this option.  For the job that most recently caused this problem, there were ~1000 array job elements and about 100 failed with errors

Steps to Reproduce:
1. Start an array job with heavy IO. Sorry to be so brief, but the scripts, setup and data are complex and very large and it woud be very difficult to duplicate them locally.  
2. Wait for jobs to fail.
3. 

Actual results:


Expected results:


Additional info

Comment 1 Niels de Vos 2015-05-17 22:01:14 UTC

GlusterFS 3.7.0 has been released (http://www.gluster.org/pipermail/gluster-users/2015-May/021901.html), and the Gluster project maintains N-2 supported releases. The last two releases before 3.7 are still maintained, at the moment these are 3.6 and 3.5.

This bug has been filed against the 3,4 release, and will not get fixed in a 3.4 version any more. Please verify if newer versions are affected with the reported problem. If that is the case, update the bug with a note, and update the version if you can. In case updating the version is not possible, leave a comment in this bug report with the version you tested, and set the "Need additional information the selected bugs from" below the comment box to "bugs".

If there is no response by the end of the month, this bug will get automatically closed.

Comment 2 Kaleb KEITHLEY 2015-10-07 12:12:21 UTC

GlusterFS 3.4.x has reached end-of-life.

If this bug still exists in a later release please reopen this and change the version or open a new bug.

Note You need to log in before you can comment on or make changes to this bug.