Bug 762772 (GLUSTER-1040) - FORTRAN I/O exhibits odd behaviors
Summary: FORTRAN I/O exhibits odd behaviors
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-1040
Product: GlusterFS
Classification: Community
Component: unclassified
Version: 3.0.3
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
Assignee: Vijay Bellur
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-06-30 19:44 UTC by Brian Smith
Modified: 2015-12-01 16:45 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
Glusterfs client log file w/ trace (133.97 KB, text/plain)
2010-06-30 17:35 UTC, Brian Smith
no flags Details
Glusterfsd log file for pvfs0-1 w/ trace (64 bytes, text/plain)
2010-06-30 17:38 UTC, Brian Smith
no flags Details
Glusterfsd log file for pvfs1-1 w/ trace (64 bytes, text/plain)
2010-06-30 17:38 UTC, Brian Smith
no flags Details

Description Brian Smith 2010-06-30 16:48:27 UTC
For some more background info:

The bricks are both SunFire X4500s running CentOS 5.4 w/ 46x500GB drives configured in 23 RAID1 devices.  LVM aggregates these into a VG and a single large LV stripes across these volumes providing 10TB of XFS space on each brick.  XFS is mounted with inode64 and has the following configuration:

# xfs_db -r -c sb -c p /dev/mapper/vg1-pvfs
magicnum = 0x58465342
blocksize = 4096
dblocks = 2684368896
rblocks = 0
rextents = 0
uuid = 86c66713-e127-4033-92be-12d3fea43ef0
logstart = 2147483776
rootino = 2048
rbmino = 2049
rsumino = 2050
rextsize = 2816
agblocks = 83886592
agcount = 32
rbmblocks = 0
logblocks = 32768
versionnum = 0x3194
sectsize = 512
inodesize = 256
inopblock = 16
fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
blocklog = 12
sectlog = 9
inodelog = 8
inopblog = 4
agblklog = 27
rextslog = 0
inprogress = 0
imax_pct = 25
icount = 2793280
ifree = 1136
fdblocks = 1281924788
frextents = 0
uquotino = 0
gquotino = 0
qflags = 0
flags = 0
shared_vn = 0
inoalignmt = 2
unit = 128
width = 2816
dirblklog = 0
logsectlog = 0
logsectsize = 0
logsunit = 0
features2 = 0

Comment 1 Harshavardhana 2010-06-30 17:09:09 UTC
> dirblklog = 0
> logsectlog = 0
> logsectsize = 0
> logsunit = 0
> features2 = 0

Thanks for the information brian, could you please run in "TRACE" log-level and send us the client and server log files?. Full log attachment would be better.

Comment 2 Brian Smith 2010-06-30 17:35:04 UTC
Created attachment 242 [details]
just a screenshot of the original error

Comment 3 Brian Smith 2010-06-30 17:38:35 UTC
Created attachment 243 [details]
Patch for kernel-2.2.14.spec file.

Comment 4 Brian Smith 2010-06-30 17:38:57 UTC
Created attachment 244 [details]
Patch for xboing bug with XFree86-4.0-0.8

Comment 5 Brian Smith 2010-06-30 17:39:36 UTC
Attachments are added.  There is a fair amount of traffic in them, so to give you a hint, you may want to focus on entries that include 'brs' in the path.

Thanks,
-Brian

Comment 6 Brian Smith 2010-06-30 19:44:09 UTC
Several of our FORTRAN-based applications experience issues while running on GlusterFS 3.0.3.  On rare occasions, the applications will work, but most of the time, applications fail with some sort of file open/stat/overwrite error.  An example is show below while running VASP:

forrtl: File exists
forrtl: severe (10): cannot overwrite existing file, unit 18, file /work/b/brs/Si/CHGCAR

Corresponding debug log entries on my storage bricks show:

[2010-06-30 15:30:54] D [server-protocol.c:2104:server_create_cbk] server-tcp: create(/b/brs/Si/CHGCAR) inode (ptr=0x2aaab00e05b0, ino=2159011921, gen=5488651098262601749) found conflict (ptr=0x2aaab40cca00, ino=2159011921, gen=5488651098262601749)
[2010-06-30 15:30:54] D [server-resolve.c:386:resolve_entry_simple] server-tcp: inode (pointer: 0x2aaab40cca00 ino:2159011921) found for path (/b/brs/Si/CHGCAR) while type is RESOLVE_NOT
[2010-06-30 15:30:54] D [server-protocol.c:2132:server_create_cbk] server-tcp: 72: CREATE (null) (0) ==> -1 (File exists)

Debug logs on clients show:

[2010-06-30 15:30:54] W [fuse-bridge.c:1719:fuse_create_cbk] glusterfs-fuse: 215: /b/brs/Si/CHGCAR => -1 (File exists)
[2010-06-30 15:30:54] D [client-protocol.c:4929:client_lookup_cbk] pvfs1-1: LOOKUP 4318019839/WAVECAR (/b/brs/Si/WAVECAR): inode number changed from {5488626372135878729,2159009924} to {5488651098262601750,2159011922}

Below are my vol files for client and server side.  (Don't mind the hostnames... they used to run PVFS2 :)  On clients, I've disabled ALL performance translators.

================ glusterfs.vol ===============
## file auto generated by /usr/bin/glusterfs-volgen (mount.vol)
# Cmd line:
# $ /usr/bin/glusterfs-volgen --name work pvfs0:/pvfs/glusterfs pvfs1:/pvfs/glusterfs

# TRANSPORT-TYPE tcp
volume pvfs0-1
    type protocol/client
    option transport-type tcp
    option remote-host pvfs0
    option remote-port 6996
    option transport.socket.nodelay on
    option remote-subvolume brick1
end-volume

volume pvfs1-1
    type protocol/client
    option transport-type tcp
    option remote-host pvfs1
    option remote-port 6996
    option transport.socket.nodelay on
    option remote-subvolume brick1
end-volume

volume distribute
    type cluster/distribute
    subvolumes pvfs0-1 pvfs1-1
end-volume

================ glusterfsd.vol ===============
## file auto generated by /usr/bin/glusterfs-volgen (export.vol)
# Cmd line:
# $ /usr/bin/glusterfs-volgen --name work pvfs0:/pvfs/glusterfs pvfs1:/pvfs/glusterfs

volume posix1
  type storage/posix
  option directory /pvfs/glusterfs
end-volume

volume locks1
    type features/locks
    subvolumes posix1
    option mandatory-locks on
end-volume

volume brick1
    type performance/io-threads
    option thread-count 8
    subvolumes locks1
end-volume

volume server-tcp
    type protocol/server
    option transport-type tcp
    option auth.addr.brick1.allow *
    option transport.socket.listen-port 6996
    option transport.socket.nodelay on
    subvolumes brick1
end-volume







Thanks in advance for your attention.

-Brian Smith

Comment 7 Harshavardhana 2010-06-30 23:10:39 UTC
(In reply to comment #6)
> Attachments are added.  There is a fair amount of traffic in them, so to give
> you a hint, you may want to focus on entries that include 'brs' in the path.

More than what we need thank you :-)

Comment 8 Jeff Darcy 2010-07-01 17:34:28 UTC
This looks very much like seven processes trying to open /b/brs/Si/CHGCAR for create simultaneously.  There are seven failed LOOKUP calls all back to back, followed by seven CREATE calls of which six (N-1) generate "found conflict" messages and fail.  From looking at the code, it seems like the "found conflict" message occurs when an inode is hashed but not yet linked to its parent, which should be a transient state.  The other message ("...while type is RESOLVE_NOT") is what should happen when the file is fully created.  It's possible that there's some condition that something is causing an inode to get into a persistent "hashed but not linked" state, but I think first we'd have to rule out the possibility that there are in fact seven open/create calls racing with one another.

If the lookup/create were treated as a single atomic operation there would be no race.  Even as two operations, the behavior is very timing-dependent which might explain why the problem does not surface on other filesystems such as PVFS2.  In any case, if there are indeed seven lookup/open/create sequences, then the obvious workaround is to serialize them somehow.

Comment 9 Brian Smith 2010-07-02 12:37:19 UTC
Jeff,

Looking at the application source code, it looks like at the beginning of execution, CHGCAR and others are opened STATUS='UNKNOWN'.  These calls are NOT contained in any conditional that checks process rank.  Given this is an MPI job, it looks as though all MPI processes that are spawned will attempt to go through this same procedure.  I need to go through more of the code and see if there are any reads/writes that are also taking place against the opened handles from all processes.  I suspect that most of the file i/o that takes place on all processes are reads of initial inputs.  If that's the case, I could work around this by calling open from only the master process, calling a barrier, then calling open on the child processes with STATUS='OLD'.  Depending on the structure of the code, this can be a royal pain if OPENs are occurring like this in other places.

-Brian

Comment 10 Jeff Darcy 2010-07-02 13:25:19 UTC
Yep, my last job was in HPC (SiCortex) and I did see an *awful* lot of code that did exactly this kind of thing.  Unfortunately, there's not a whole lot the filesystem can do, because the "lookup, then create or open depending on the result" sequence is done within the kernel (do_filp_open in fs/namei.c).  One possible workaround, if there are too many applications to fix, would be to add a layer somewhere that automatically converts CREATEs which fail with EEXIST into OPENs.  This could be done with a translator, with an LD_PRELOAD, or with some even more subtle kinds of dynamic-loader magic.  In all cases it's an approach that requires extreme caution because there are likely to be other cases where an open with O_EXCL (status='new' in FORTRAN) is used properly for synchronization/exclusion and the EEXIST must not be suppressed in this way.

Does it seem fair to you, based on what we've discovered, to say that GlusterFS is behaving correctly in this case (i.e. that the bug can be closed)?  I'd be glad to help you implement a workaround, but AFAICT there's nothing to be done to the Gluster code as it is now.

Comment 11 Brian Smith 2010-07-02 16:04:34 UTC
I think you're right, its not a bug, but it does pose a compatibility issue.  If all other filesystems are handling this correctly, does it not make sense that Gluster should as well?  You agreed from your experience that there are numerous HPC-related codes that do this (which is bad), so maybe something like an RFE is more appropriate?

Comment 12 Brian Smith 2010-07-02 16:05:52 UTC
Perhaps "correctly" is to strong a word... "gracefully" would be better.  It seems odd that this is allowed to work anywhere, but it does.

Comment 13 Brian Smith 2010-07-02 16:14:48 UTC
It appears that if I touch the expected output files to create them before hand, everything works correctly.  Thanks for your help and let me know if there's anything else I can do to help if there's ever an effort made to support this sort of bad programming :)

Comment 14 Vikas Gorur 2010-07-02 17:08:26 UTC
Looking at code in xlators/storage/posix/src/posix.c, posix_create, I wonder if it might be a GlusterFS bug after all:

int32_t
posix_create (call_frame_t *frame, xlator_t *this,
              loc_t *loc, int32_t flags, mode_t mode,
              fd_t *fd)

...

        if (!flags) {
                _flags = O_CREAT | O_RDWR | O_EXCL;
        }
        else {
                _flags = flags | O_CREAT;
        }

As I understand it, the FORTRAN code is doing an open(2) without specifying O_EXCL, and hence even with the race should not see EEXIST.

In the code above the 'flags' argument should usually be set, but I'm wondering if the (!flags) path is being taken. Perhaps we are adding the O_EXCL flag when we shouldn't be doing that. 

Brian, if you can compile from source can you try changing the above line to:

       _flags = O_CREAT | O_RDWR;

and see if that fixes the bug?

Comment 15 Jeff Darcy 2010-07-02 17:19:52 UTC
(In reply to comment #11)
> I think you're right, its not a bug, but it does pose a compatibility issue. 
> If all other filesystems are handling this correctly, does it not make sense
> that Gluster should as well?  You agreed from your experience that there are
> numerous HPC-related codes that do this (which is bad), so maybe something like
> an RFE is more appropriate?

I'm not really sure other filesystems are handling this correctly, except to the extent that they don't create/allow the timing necessary for failure.  In fact, I've been able to reproduce this using a test program even on a local filesystem.  All it takes is enough iterations at a high enough rate.

The only GlusterFS enhancement I can think of that would help would be an option to suppress/retry EEXIST errors, but I wouldn't be very optimistic about that being implemented.  I'll defer to Vijay on that, though.  In any case I'm glad to see that touching the file before a run seems to provide a workaround at least some of the time.

Comment 16 Jeff Darcy 2010-07-02 17:27:06 UTC
(In reply to comment #14)
> As I understand it, the FORTRAN code is doing an open(2) without specifying
> O_EXCL, and hence even with the race should not see EEXIST.


My understanding is that "status='new'" in Fortran does imply that O_EXCL would already be present, and when I stepped through the code I'm pretty sure we were going through the second half of that if-statement.  If you look at server_create and server_create_resume (.../protocol/server/src/server-protocol.c) and the equivalent on the client side, it does look like we're passing flags we got from elsewhere.

Comment 17 Vikas Gorur 2010-07-02 17:59:44 UTC
(In reply to comment #16)

> My understanding is that "status='new'" in Fortran does imply that O_EXCL would
> already be present, and when I stepped through the code I'm pretty sure we were
> going through the second half of that if-statement.  If you look at
> server_create and server_create_resume
> (.../protocol/server/src/server-protocol.c) and the equivalent on the client
> side, it does look like we're passing flags we got from elsewhere.

Ok, if Fortran itself is passing O_EXCL then there is no way to avoid this race.

Comment 18 Brian Smith 2010-07-02 18:15:25 UTC
The fortran code is actually calling OPEN with STATUS='UNKNOWN'.  This should trigger a call with STATUS='NEW' if an OPEN with STATUS='OLD' fails.  Some searching around indicates that STATUS='NEW' acts like O_CREAT|O_EXCL so it looks like Jeff has this pretty much down.  I've gone ahead and modified my user's submission tools to include a touch at the beginning of every run on the files in question.  It would probably be worthwhile for me to submit a patch to the developers of my application to fix this issue.  Thanks for your help guys!

Comment 19 Amar Tumballi 2010-10-05 06:43:14 UTC
3.0.5 is done some time back, and we will address this in either 3.0.6 or 3.1.1 (if it still exists in 3.1.0 release)

Comment 20 Amar Tumballi 2010-10-05 09:38:35 UTC
(In reply to comment #18)
> The fortran code is actually calling OPEN with STATUS='UNKNOWN'.  This should
> trigger a call with STATUS='NEW' if an OPEN with STATUS='OLD' fails.  Some
> searching around indicates that STATUS='NEW' acts like O_CREAT|O_EXCL so it
> looks like Jeff has this pretty much down.  I've gone ahead and modified my
> user's submission tools to include a touch at the beginning of every run on the
> files in question.  It would probably be worthwhile for me to submit a patch to
> the developers of my application to fix this issue.  Thanks for your help guys!

Closing as per above statement.. Issue with create flags itself


Note You need to log in before you can comment on or make changes to this bug.