I have a 5 node, 10 brick, replica 2 GlusterFS setup. GlusterFS is providing shared storage to an HTCondor compute cluster. It is observed that, periodically, jobs experience failures when calling open() to open a file for writing. Failures are isolated to a specific job which is comprised of approximately 100 distinct tasks running in parallel. Other jobs running at that time are not affected. When a job starts it spawns numerous tasks in parallel. Each of these tasks, at startup, effectively does: if [ ! -d /shared/scratchdir ]; then mkdir -p /shared/scratchdir fi run-python-program Where run-python-program will change directory to /shared/scratchdir and then attempt to open a file for writing. It is this open that fails with an rc of -1. Most jobs succeed, some jobs fail with this specific error. If one task within a job encounters this error it seems all encounter it. Gluster log files show the following messages (in this log snippet, /shared/scratchdir would correspond to /scratch/d0500f5c): srv-gv0.log:[2014-12-17 23:19:08.186593] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-9: remote operation failed: File exists. Path: /scratch/d0500f5c srv-gv0.log:[2014-12-17 23:19:08.186669] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-8: remote operation failed: File exists. Path: /scratch/d0500f5c srv-gv0.log:[2014-12-17 23:19:08.186739] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-4: remote operation failed: File exists. Path: /scratch/d0500f5c srv-gv0.log:[2014-12-17 23:19:08.204792] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-2: remote operation failed: File exists. Path: /scratch/d0500f5c srv-gv0.log:[2014-12-17 23:19:08.212364] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-5: remote operation failed: File exists. Path: /scratch/d0500f5c srv-gv0.log:[2014-12-17 23:19:08.220757] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-3: remote operation failed: File exists. Path: /scratch/d0500f5c srv-gv0.log:[2014-12-17 23:19:08.365600] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-7: remote operation failed: File exists. Path: /scratch/d0500f5c srv-gv0.log:[2014-12-17 23:19:08.380489] W [client-rpc-fops.c:322:client3_3_mkdir_cbk] 0-gv0-client-6: remote operation failed: File exists. Path: /scratch/d0500f5c srv-gv0.log:[2014-12-17 23:19:08.405157] W [fuse-bridge.c:407:fuse_entry_cbk] 0-glusterfs-fuse: Received NULL gfid for /scratch/d0500f5c. Forcing EIO srv-gv0.log:[2014-12-17 23:19:08.405198] W [fuse-bridge.c:472:fuse_entry_cbk] 0-glusterfs-fuse: 994254678: MKDIR() /scratch/d0500f5c => -1 (Input/output error) srv-gv0.log:[2014-12-17 23:19:08.420102] I [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 0-gv0-replicate-4: metadata self heal is successfully completed, metadata self heal from source gv0-client-9 to gv0-client-8, metadata - Pending matrix: [ [ 1 1 ] [ 1 1 ] ], on /scratch/d0500f5c srv-gv0.log:[2014-12-17 23:19:08.421130] I [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 0-gv0-replicate-2: metadata self heal is successfully completed, metadata self heal from source gv0-client-4 to gv0-client-5, metadata - Pending matrix: [ [ 1 1 ] [ 1 1 ] ], on /scratch/d0500f5c srv-gv0.log:[2014-12-17 23:19:08.425125] I [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 0-gv0-replicate-1: metadata self heal is successfully completed, metadata self heal from source gv0-client-2 to gv0-client-3, metadata - Pending matrix: [ [ 1 1 ] [ 1 1 ] ], on /scratch/d0500f5c srv-gv0.log:[2014-12-17 23:19:08.437584] I [afr-self-heal-common.c:2868:afr_log_self_heal_completion_status] 0-gv0-replicate-3: metadata self heal is successfully completed, metadata self heal from source gv0-client-6 to gv0-client-7, metadata - Pending matrix: [ [ 1 1 ] [ 1 1 ] ], on /scratch/d0500f5c It is also observed that, sometimes, instead of open() failing with -1 the mkdir -p will fail with "mkdir: cannot create directory `/srv/gv0/scratch/f3a2b930': Input/output error" So, you have multiple, simultaenous (or near simultaneous) mkdir -p operations happening on a specific directory on a GlusterFS 10 brick replica 2 volume. Most of the time mkdir succeeds but open() fails with -1. Sometimes mkdir fails. In every instance that I have checked after a failure the directory exists and is empty. In conversations on IRC with Joe Julian it was suggested to confirm that each of the compute nodes had a connection to each brick. I did that check with netstat and we confirmed it was the case. It would appear that there is a race of some sort whereby each of the mkdir -p operations succeed (most of the time), but an open fails until a self-heal occurs to bring meta-data into a consistent state. Gluster information: $ glusterfs -V glusterfs 3.5.2 built on Aug 6 2014 19:36:58 System information: $ uname -a Linux dallas 3.13.0-40-generic #69~precise1-Ubuntu SMP Fri Nov 14 10:29:31 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 12.04.5 LTS Release: 12.04 Codename: precise
> When a job starts it spawns numerous tasks in parallel. Each of these tasks, > at startup, effectively does: > > if [ ! -d /shared/scratchdir ]; then > mkdir -p /shared/scratchdir > fi I suspect that this concurrent creation of the directory is the issue. Other bugs have been linked with this one where this problem is observed too. It sounds possible that one process was able to create the directory on some of the bricks, and an other process created the same directory on other bricks. This leads to a directory that has different GFIDs on different bricks. A workaround would be to create the /shared/scratchdir directory before the job starts the additional tasks. Adding the maintainers of DHT on CC who should know more about this issue and the current state of the solution (http://review.gluster.org/4846).
(In reply to Niels de Vos from comment #1) > A workaround would be to create the /shared/scratchdir directory before the > job starts the additional tasks. Yes, you're right about this. I changed the job flow to create the directory first and so far have not seen the failure again.
This bug is getting closed because the 3.5 is marked End-Of-Life. There will be no further updates to this version. Please open a new bug against a version that still receives bugfixes if you are still facing this issue in a more current release.