| Summary: | FORTRAN I/O exhibits odd behaviors | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Brian Smith <brs> | ||||||||
| Component: | unclassified | Assignee: | Vijay Bellur <vbellur> | ||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | high | ||||||||||
| Version: | 3.0.3 | CC: | amarts, fharshav, gluster-bugs, jdarcy, vikas | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | Type: | --- | |||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Attachments: |
|
||||||||||
|
Description
Brian Smith
2010-06-30 16:48:27 UTC
> dirblklog = 0
> logsectlog = 0
> logsectsize = 0
> logsunit = 0
> features2 = 0
Thanks for the information brian, could you please run in "TRACE" log-level and send us the client and server log files?. Full log attachment would be better.
Created attachment 242 [details]
just a screenshot of the original error
Created attachment 243 [details]
Patch for kernel-2.2.14.spec file.
Created attachment 244 [details]
Patch for xboing bug with XFree86-4.0-0.8
Attachments are added. There is a fair amount of traffic in them, so to give you a hint, you may want to focus on entries that include 'brs' in the path. Thanks, -Brian Several of our FORTRAN-based applications experience issues while running on GlusterFS 3.0.3. On rare occasions, the applications will work, but most of the time, applications fail with some sort of file open/stat/overwrite error. An example is show below while running VASP:
forrtl: File exists
forrtl: severe (10): cannot overwrite existing file, unit 18, file /work/b/brs/Si/CHGCAR
Corresponding debug log entries on my storage bricks show:
[2010-06-30 15:30:54] D [server-protocol.c:2104:server_create_cbk] server-tcp: create(/b/brs/Si/CHGCAR) inode (ptr=0x2aaab00e05b0, ino=2159011921, gen=5488651098262601749) found conflict (ptr=0x2aaab40cca00, ino=2159011921, gen=5488651098262601749)
[2010-06-30 15:30:54] D [server-resolve.c:386:resolve_entry_simple] server-tcp: inode (pointer: 0x2aaab40cca00 ino:2159011921) found for path (/b/brs/Si/CHGCAR) while type is RESOLVE_NOT
[2010-06-30 15:30:54] D [server-protocol.c:2132:server_create_cbk] server-tcp: 72: CREATE (null) (0) ==> -1 (File exists)
Debug logs on clients show:
[2010-06-30 15:30:54] W [fuse-bridge.c:1719:fuse_create_cbk] glusterfs-fuse: 215: /b/brs/Si/CHGCAR => -1 (File exists)
[2010-06-30 15:30:54] D [client-protocol.c:4929:client_lookup_cbk] pvfs1-1: LOOKUP 4318019839/WAVECAR (/b/brs/Si/WAVECAR): inode number changed from {5488626372135878729,2159009924} to {5488651098262601750,2159011922}
Below are my vol files for client and server side. (Don't mind the hostnames... they used to run PVFS2 :) On clients, I've disabled ALL performance translators.
================ glusterfs.vol ===============
## file auto generated by /usr/bin/glusterfs-volgen (mount.vol)
# Cmd line:
# $ /usr/bin/glusterfs-volgen --name work pvfs0:/pvfs/glusterfs pvfs1:/pvfs/glusterfs
# TRANSPORT-TYPE tcp
volume pvfs0-1
type protocol/client
option transport-type tcp
option remote-host pvfs0
option remote-port 6996
option transport.socket.nodelay on
option remote-subvolume brick1
end-volume
volume pvfs1-1
type protocol/client
option transport-type tcp
option remote-host pvfs1
option remote-port 6996
option transport.socket.nodelay on
option remote-subvolume brick1
end-volume
volume distribute
type cluster/distribute
subvolumes pvfs0-1 pvfs1-1
end-volume
================ glusterfsd.vol ===============
## file auto generated by /usr/bin/glusterfs-volgen (export.vol)
# Cmd line:
# $ /usr/bin/glusterfs-volgen --name work pvfs0:/pvfs/glusterfs pvfs1:/pvfs/glusterfs
volume posix1
type storage/posix
option directory /pvfs/glusterfs
end-volume
volume locks1
type features/locks
subvolumes posix1
option mandatory-locks on
end-volume
volume brick1
type performance/io-threads
option thread-count 8
subvolumes locks1
end-volume
volume server-tcp
type protocol/server
option transport-type tcp
option auth.addr.brick1.allow *
option transport.socket.listen-port 6996
option transport.socket.nodelay on
subvolumes brick1
end-volume
Thanks in advance for your attention.
-Brian Smith
(In reply to comment #6) > Attachments are added. There is a fair amount of traffic in them, so to give > you a hint, you may want to focus on entries that include 'brs' in the path. More than what we need thank you :-) This looks very much like seven processes trying to open /b/brs/Si/CHGCAR for create simultaneously. There are seven failed LOOKUP calls all back to back, followed by seven CREATE calls of which six (N-1) generate "found conflict" messages and fail. From looking at the code, it seems like the "found conflict" message occurs when an inode is hashed but not yet linked to its parent, which should be a transient state. The other message ("...while type is RESOLVE_NOT") is what should happen when the file is fully created. It's possible that there's some condition that something is causing an inode to get into a persistent "hashed but not linked" state, but I think first we'd have to rule out the possibility that there are in fact seven open/create calls racing with one another.
If the lookup/create were treated as a single atomic operation there would be no race. Even as two operations, the behavior is very timing-dependent which might explain why the problem does not surface on other filesystems such as PVFS2. In any case, if there are indeed seven lookup/open/create sequences, then the obvious workaround is to serialize them somehow.
Jeff, Looking at the application source code, it looks like at the beginning of execution, CHGCAR and others are opened STATUS='UNKNOWN'. These calls are NOT contained in any conditional that checks process rank. Given this is an MPI job, it looks as though all MPI processes that are spawned will attempt to go through this same procedure. I need to go through more of the code and see if there are any reads/writes that are also taking place against the opened handles from all processes. I suspect that most of the file i/o that takes place on all processes are reads of initial inputs. If that's the case, I could work around this by calling open from only the master process, calling a barrier, then calling open on the child processes with STATUS='OLD'. Depending on the structure of the code, this can be a royal pain if OPENs are occurring like this in other places. -Brian Yep, my last job was in HPC (SiCortex) and I did see an *awful* lot of code that did exactly this kind of thing. Unfortunately, there's not a whole lot the filesystem can do, because the "lookup, then create or open depending on the result" sequence is done within the kernel (do_filp_open in fs/namei.c). One possible workaround, if there are too many applications to fix, would be to add a layer somewhere that automatically converts CREATEs which fail with EEXIST into OPENs. This could be done with a translator, with an LD_PRELOAD, or with some even more subtle kinds of dynamic-loader magic. In all cases it's an approach that requires extreme caution because there are likely to be other cases where an open with O_EXCL (status='new' in FORTRAN) is used properly for synchronization/exclusion and the EEXIST must not be suppressed in this way. Does it seem fair to you, based on what we've discovered, to say that GlusterFS is behaving correctly in this case (i.e. that the bug can be closed)? I'd be glad to help you implement a workaround, but AFAICT there's nothing to be done to the Gluster code as it is now. I think you're right, its not a bug, but it does pose a compatibility issue. If all other filesystems are handling this correctly, does it not make sense that Gluster should as well? You agreed from your experience that there are numerous HPC-related codes that do this (which is bad), so maybe something like an RFE is more appropriate? Perhaps "correctly" is to strong a word... "gracefully" would be better. It seems odd that this is allowed to work anywhere, but it does. It appears that if I touch the expected output files to create them before hand, everything works correctly. Thanks for your help and let me know if there's anything else I can do to help if there's ever an effort made to support this sort of bad programming :) Looking at code in xlators/storage/posix/src/posix.c, posix_create, I wonder if it might be a GlusterFS bug after all:
int32_t
posix_create (call_frame_t *frame, xlator_t *this,
loc_t *loc, int32_t flags, mode_t mode,
fd_t *fd)
...
if (!flags) {
_flags = O_CREAT | O_RDWR | O_EXCL;
}
else {
_flags = flags | O_CREAT;
}
As I understand it, the FORTRAN code is doing an open(2) without specifying O_EXCL, and hence even with the race should not see EEXIST.
In the code above the 'flags' argument should usually be set, but I'm wondering if the (!flags) path is being taken. Perhaps we are adding the O_EXCL flag when we shouldn't be doing that.
Brian, if you can compile from source can you try changing the above line to:
_flags = O_CREAT | O_RDWR;
and see if that fixes the bug?
(In reply to comment #11) > I think you're right, its not a bug, but it does pose a compatibility issue. > If all other filesystems are handling this correctly, does it not make sense > that Gluster should as well? You agreed from your experience that there are > numerous HPC-related codes that do this (which is bad), so maybe something like > an RFE is more appropriate? I'm not really sure other filesystems are handling this correctly, except to the extent that they don't create/allow the timing necessary for failure. In fact, I've been able to reproduce this using a test program even on a local filesystem. All it takes is enough iterations at a high enough rate. The only GlusterFS enhancement I can think of that would help would be an option to suppress/retry EEXIST errors, but I wouldn't be very optimistic about that being implemented. I'll defer to Vijay on that, though. In any case I'm glad to see that touching the file before a run seems to provide a workaround at least some of the time. (In reply to comment #14) > As I understand it, the FORTRAN code is doing an open(2) without specifying > O_EXCL, and hence even with the race should not see EEXIST. My understanding is that "status='new'" in Fortran does imply that O_EXCL would already be present, and when I stepped through the code I'm pretty sure we were going through the second half of that if-statement. If you look at server_create and server_create_resume (.../protocol/server/src/server-protocol.c) and the equivalent on the client side, it does look like we're passing flags we got from elsewhere. (In reply to comment #16) > My understanding is that "status='new'" in Fortran does imply that O_EXCL would > already be present, and when I stepped through the code I'm pretty sure we were > going through the second half of that if-statement. If you look at > server_create and server_create_resume > (.../protocol/server/src/server-protocol.c) and the equivalent on the client > side, it does look like we're passing flags we got from elsewhere. Ok, if Fortran itself is passing O_EXCL then there is no way to avoid this race. The fortran code is actually calling OPEN with STATUS='UNKNOWN'. This should trigger a call with STATUS='NEW' if an OPEN with STATUS='OLD' fails. Some searching around indicates that STATUS='NEW' acts like O_CREAT|O_EXCL so it looks like Jeff has this pretty much down. I've gone ahead and modified my user's submission tools to include a touch at the beginning of every run on the files in question. It would probably be worthwhile for me to submit a patch to the developers of my application to fix this issue. Thanks for your help guys! 3.0.5 is done some time back, and we will address this in either 3.0.6 or 3.1.1 (if it still exists in 3.1.0 release) (In reply to comment #18) > The fortran code is actually calling OPEN with STATUS='UNKNOWN'. This should > trigger a call with STATUS='NEW' if an OPEN with STATUS='OLD' fails. Some > searching around indicates that STATUS='NEW' acts like O_CREAT|O_EXCL so it > looks like Jeff has this pretty much down. I've gone ahead and modified my > user's submission tools to include a touch at the beginning of every run on the > files in question. It would probably be worthwhile for me to submit a patch to > the developers of my application to fix this issue. Thanks for your help guys! Closing as per above statement.. Issue with create flags itself |