Bug 1175551

Summary:	Intermittent open() failure after creating a directory
Product:	[Community] GlusterFS	Reporter:	Louis R. Marascio <marascio>
Component:	distribute	Assignee:	bugs <bugs>
Status:	CLOSED EOL	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.5.2	CC:	bugs, ndevos, rgowdapp, srangana
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-06-17 15:58:07 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1094724, 1121920
Bug Blocks:

Description Louis R. Marascio 2014-12-18 02:09:21 UTC

I have a 5 node, 10 brick, replica 2 GlusterFS setup. GlusterFS is providing shared storage to an HTCondor compute cluster.

It is observed that, periodically, jobs experience failures when calling open() to open a file for writing. Failures are isolated to a specific job which is comprised of approximately 100 distinct tasks running in parallel. Other jobs running at that time are not affected.

When a job starts it spawns numerous tasks in parallel. Each of these tasks, at startup, effectively does:

if [ ! -d /shared/scratchdir ]; then
    mkdir -p /shared/scratchdir
fi

run-python-program

Where run-python-program will change directory to /shared/scratchdir and then attempt to open a file for writing. It is this open that fails with an rc of -1.

Most jobs succeed, some jobs fail with this specific error. If one task within a job encounters this error it seems all encounter it.

Gluster log files show the following messages (in this log snippet, /shared/scratchdir would correspond to /scratch/d0500f5c):

srv-gv0.log:[2014-12-17 23:19:08.186593] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-9: remote operation failed: File exists. Path: /scratch/d0500f5c
srv-gv0.log:[2014-12-17 23:19:08.186669] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-8: remote operation failed: File exists. Path: /scratch/d0500f5c
srv-gv0.log:[2014-12-17 23:19:08.186739] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-4: remote operation failed: File exists. Path: /scratch/d0500f5c
srv-gv0.log:[2014-12-17 23:19:08.204792] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-2: remote operation failed: File exists. Path: /scratch/d0500f5c
srv-gv0.log:[2014-12-17 23:19:08.212364] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-5: remote operation failed: File exists. Path: /scratch/d0500f5c
srv-gv0.log:[2014-12-17 23:19:08.220757] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-3: remote operation failed: File exists. Path: /scratch/d0500f5c
srv-gv0.log:[2014-12-17 23:19:08.365600] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-7: remote operation failed: File exists. Path: /scratch/d0500f5c
srv-gv0.log:[2014-12-17 23:19:08.380489] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-6: remote operation failed: File exists. Path: /scratch/d0500f5c
srv-gv0.log:[2014-12-17 23:19:08.405157] W [fuse-bridge.c:407:fuse_entry_cbk] 0-glusterfs-fuse: Received NULL gfid for /scratch/d0500f5c. Forcing EIO
srv-gv0.log:[2014-12-17 23:19:08.405198] W [fuse-bridge.c:472:fuse_entry_cbk] 0-glusterfs-fuse: 994254678: MKDIR() /scratch/d0500f5c => -1 (Input/output error)
srv-gv0.log:[2014-12-17 23:19:08.420102] I [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 0-gv0-replicate-4:  metadata self heal  is successfully completed,   metadata self heal from source gv0-client-9 to gv0-client-8,  metadata - Pending matrix:  [ [ 1 1 ] [ 1 1 ] ], on /scratch/d0500f5c
srv-gv0.log:[2014-12-17 23:19:08.421130] I [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 0-gv0-replicate-2:  metadata self heal  is successfully completed,   metadata self heal from source gv0-client-4 to gv0-client-5,  metadata - Pending matrix:  [ [ 1 1 ] [ 1 1 ] ], on /scratch/d0500f5c
srv-gv0.log:[2014-12-17 23:19:08.425125] I [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 0-gv0-replicate-1:  metadata self heal  is successfully completed,   metadata self heal from source gv0-client-2 to gv0-client-3,  metadata - Pending matrix:  [ [ 1 1 ] [ 1 1 ] ], on /scratch/d0500f5c
srv-gv0.log:[2014-12-17 23:19:08.437584] I [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 0-gv0-replicate-3:  metadata self heal  is successfully completed,   metadata self heal from source gv0-client-6 to gv0-client-7,  metadata - Pending matrix:  [ [ 1 1 ] [ 1 1 ] ], on /scratch/d0500f5c

It is also observed that, sometimes, instead of open() failing with -1 the mkdir -p will fail with "mkdir: cannot create directory `/srv/gv0/scratch/f3a2b930': Input/output error"

So, you have multiple, simultaenous (or near simultaneous) mkdir -p operations happening on a specific directory on a GlusterFS 10 brick replica 2 volume. Most of the time mkdir succeeds but open() fails with -1. Sometimes mkdir fails. In every instance that I have checked after a failure the directory exists and is empty.

In conversations on IRC with Joe Julian it was suggested to confirm that each of the compute nodes had a connection to each brick. I did that check with netstat and we confirmed it was the case. It would appear that there is a race of some sort whereby each of the mkdir -p operations succeed (most of the time), but an open fails until a self-heal occurs to bring meta-data into a consistent state.

Gluster information:

$ glusterfs -V
glusterfs 3.5.2 built on Aug  6 2014 19:36:58

System information:

$ uname -a
Linux dallas 3.13.0-40-generic #69~precise1-Ubuntu SMP Fri Nov 14 10:29:31 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 12.04.5 LTS
Release:	12.04
Codename:	precise

Comment 1 Niels de Vos 2014-12-20 19:15:15 UTC

> When a job starts it spawns numerous tasks in parallel. Each of these tasks,
> at startup, effectively does:
> 
> if [ ! -d /shared/scratchdir ]; then
>     mkdir -p /shared/scratchdir
> fi

I suspect that this concurrent creation of the directory is the issue. Other bugs have been linked with this one where this problem is observed too.

It sounds possible that one process was able to create the directory on some of the bricks, and an other process created the same directory on other bricks. This leads to a directory that has different GFIDs on different bricks.

A workaround would be to create the /shared/scratchdir directory before the job starts the additional tasks.

Adding the maintainers of DHT on CC who should know more about this issue and the current state of the solution (http://review.gluster.org/4846).

Comment 2 Louis R. Marascio 2014-12-20 19:20:54 UTC

(In reply to Niels de Vos from comment #1)
> A workaround would be to create the /shared/scratchdir directory before the
> job starts the additional tasks.

Yes, you're right about this. I changed the job flow to create the directory first and so far have not seen the failure again.

Comment 3 Niels de Vos 2016-06-17 15:58:07 UTC

This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.