Hide Forgot
For some more background info: The bricks are both SunFire X4500s running CentOS 5.4 w/ 46x500GB drives configured in 23 RAID1 devices. LVM aggregates these into a VG and a single large LV stripes across these volumes providing 10TB of XFS space on each brick. XFS is mounted with inode64 and has the following configuration: # xfs_db -r -c sb -c p /dev/mapper/vg1-pvfs magicnum = 0x58465342 blocksize = 4096 dblocks = 2684368896 rblocks = 0 rextents = 0 uuid = 86c66713-e127-4033-92be-12d3fea43ef0 logstart = 2147483776 rootino = 2048 rbmino = 2049 rsumino = 2050 rextsize = 2816 agblocks = 83886592 agcount = 32 rbmblocks = 0 logblocks = 32768 versionnum = 0x3194 sectsize = 512 inodesize = 256 inopblock = 16 fname = "\000\000\000\000\000\000\000\000\000\000\000\000" blocklog = 12 sectlog = 9 inodelog = 8 inopblog = 4 agblklog = 27 rextslog = 0 inprogress = 0 imax_pct = 25 icount = 2793280 ifree = 1136 fdblocks = 1281924788 frextents = 0 uquotino = 0 gquotino = 0 qflags = 0 flags = 0 shared_vn = 0 inoalignmt = 2 unit = 128 width = 2816 dirblklog = 0 logsectlog = 0 logsectsize = 0 logsunit = 0 features2 = 0
> dirblklog = 0 > logsectlog = 0 > logsectsize = 0 > logsunit = 0 > features2 = 0 Thanks for the information brian, could you please run in "TRACE" log-level and send us the client and server log files?. Full log attachment would be better.
Created attachment 242 [details] just a screenshot of the original error
Created attachment 243 [details] Patch for kernel-2.2.14.spec file.
Created attachment 244 [details] Patch for xboing bug with XFree86-4.0-0.8
Attachments are added. There is a fair amount of traffic in them, so to give you a hint, you may want to focus on entries that include 'brs' in the path. Thanks, -Brian
Several of our FORTRAN-based applications experience issues while running on GlusterFS 3.0.3. On rare occasions, the applications will work, but most of the time, applications fail with some sort of file open/stat/overwrite error. An example is show below while running VASP: forrtl: File exists forrtl: severe (10): cannot overwrite existing file, unit 18, file /work/b/brs/Si/CHGCAR Corresponding debug log entries on my storage bricks show: [2010-06-30 15:30:54] D [server-protocol.c:2104:server_create_cbk] server-tcp: create(/b/brs/Si/CHGCAR) inode (ptr=0x2aaab00e05b0, ino=2159011921, gen=5488651098262601749) found conflict (ptr=0x2aaab40cca00, ino=2159011921, gen=5488651098262601749) [2010-06-30 15:30:54] D [server-resolve.c:386:resolve_entry_simple] server-tcp: inode (pointer: 0x2aaab40cca00 ino:2159011921) found for path (/b/brs/Si/CHGCAR) while type is RESOLVE_NOT [2010-06-30 15:30:54] D [server-protocol.c:2132:server_create_cbk] server-tcp: 72: CREATE (null) (0) ==> -1 (File exists) Debug logs on clients show: [2010-06-30 15:30:54] W [fuse-bridge.c:1719:fuse_create_cbk] glusterfs-fuse: 215: /b/brs/Si/CHGCAR => -1 (File exists) [2010-06-30 15:30:54] D [client-protocol.c:4929:client_lookup_cbk] pvfs1-1: LOOKUP 4318019839/WAVECAR (/b/brs/Si/WAVECAR): inode number changed from {5488626372135878729,2159009924} to {5488651098262601750,2159011922} Below are my vol files for client and server side. (Don't mind the hostnames... they used to run PVFS2 :) On clients, I've disabled ALL performance translators. ================ glusterfs.vol =============== ## file auto generated by /usr/bin/glusterfs-volgen (mount.vol) # Cmd line: # $ /usr/bin/glusterfs-volgen --name work pvfs0:/pvfs/glusterfs pvfs1:/pvfs/glusterfs # TRANSPORT-TYPE tcp volume pvfs0-1 type protocol/client option transport-type tcp option remote-host pvfs0 option remote-port 6996 option transport.socket.nodelay on option remote-subvolume brick1 end-volume volume pvfs1-1 type protocol/client option transport-type tcp option remote-host pvfs1 option remote-port 6996 option transport.socket.nodelay on option remote-subvolume brick1 end-volume volume distribute type cluster/distribute subvolumes pvfs0-1 pvfs1-1 end-volume ================ glusterfsd.vol =============== ## file auto generated by /usr/bin/glusterfs-volgen (export.vol) # Cmd line: # $ /usr/bin/glusterfs-volgen --name work pvfs0:/pvfs/glusterfs pvfs1:/pvfs/glusterfs volume posix1 type storage/posix option directory /pvfs/glusterfs end-volume volume locks1 type features/locks subvolumes posix1 option mandatory-locks on end-volume volume brick1 type performance/io-threads option thread-count 8 subvolumes locks1 end-volume volume server-tcp type protocol/server option transport-type tcp option auth.addr.brick1.allow * option transport.socket.listen-port 6996 option transport.socket.nodelay on subvolumes brick1 end-volume Thanks in advance for your attention. -Brian Smith
(In reply to comment #6) > Attachments are added. There is a fair amount of traffic in them, so to give > you a hint, you may want to focus on entries that include 'brs' in the path. More than what we need thank you :-)
This looks very much like seven processes trying to open /b/brs/Si/CHGCAR for create simultaneously. There are seven failed LOOKUP calls all back to back, followed by seven CREATE calls of which six (N-1) generate "found conflict" messages and fail. From looking at the code, it seems like the "found conflict" message occurs when an inode is hashed but not yet linked to its parent, which should be a transient state. The other message ("...while type is RESOLVE_NOT") is what should happen when the file is fully created. It's possible that there's some condition that something is causing an inode to get into a persistent "hashed but not linked" state, but I think first we'd have to rule out the possibility that there are in fact seven open/create calls racing with one another. If the lookup/create were treated as a single atomic operation there would be no race. Even as two operations, the behavior is very timing-dependent which might explain why the problem does not surface on other filesystems such as PVFS2. In any case, if there are indeed seven lookup/open/create sequences, then the obvious workaround is to serialize them somehow.
Jeff, Looking at the application source code, it looks like at the beginning of execution, CHGCAR and others are opened STATUS='UNKNOWN'. These calls are NOT contained in any conditional that checks process rank. Given this is an MPI job, it looks as though all MPI processes that are spawned will attempt to go through this same procedure. I need to go through more of the code and see if there are any reads/writes that are also taking place against the opened handles from all processes. I suspect that most of the file i/o that takes place on all processes are reads of initial inputs. If that's the case, I could work around this by calling open from only the master process, calling a barrier, then calling open on the child processes with STATUS='OLD'. Depending on the structure of the code, this can be a royal pain if OPENs are occurring like this in other places. -Brian
Yep, my last job was in HPC (SiCortex) and I did see an *awful* lot of code that did exactly this kind of thing. Unfortunately, there's not a whole lot the filesystem can do, because the "lookup, then create or open depending on the result" sequence is done within the kernel (do_filp_open in fs/namei.c). One possible workaround, if there are too many applications to fix, would be to add a layer somewhere that automatically converts CREATEs which fail with EEXIST into OPENs. This could be done with a translator, with an LD_PRELOAD, or with some even more subtle kinds of dynamic-loader magic. In all cases it's an approach that requires extreme caution because there are likely to be other cases where an open with O_EXCL (status='new' in FORTRAN) is used properly for synchronization/exclusion and the EEXIST must not be suppressed in this way. Does it seem fair to you, based on what we've discovered, to say that GlusterFS is behaving correctly in this case (i.e. that the bug can be closed)? I'd be glad to help you implement a workaround, but AFAICT there's nothing to be done to the Gluster code as it is now.
I think you're right, its not a bug, but it does pose a compatibility issue. If all other filesystems are handling this correctly, does it not make sense that Gluster should as well? You agreed from your experience that there are numerous HPC-related codes that do this (which is bad), so maybe something like an RFE is more appropriate?
Perhaps "correctly" is to strong a word... "gracefully" would be better. It seems odd that this is allowed to work anywhere, but it does.
It appears that if I touch the expected output files to create them before hand, everything works correctly. Thanks for your help and let me know if there's anything else I can do to help if there's ever an effort made to support this sort of bad programming :)
Looking at code in xlators/storage/posix/src/posix.c, posix_create, I wonder if it might be a GlusterFS bug after all: int32_t posix_create (call_frame_t *frame, xlator_t *this, loc_t *loc, int32_t flags, mode_t mode, fd_t *fd) ... if (!flags) { _flags = O_CREAT | O_RDWR | O_EXCL; } else { _flags = flags | O_CREAT; } As I understand it, the FORTRAN code is doing an open(2) without specifying O_EXCL, and hence even with the race should not see EEXIST. In the code above the 'flags' argument should usually be set, but I'm wondering if the (!flags) path is being taken. Perhaps we are adding the O_EXCL flag when we shouldn't be doing that. Brian, if you can compile from source can you try changing the above line to: _flags = O_CREAT | O_RDWR; and see if that fixes the bug?
(In reply to comment #11) > I think you're right, its not a bug, but it does pose a compatibility issue. > If all other filesystems are handling this correctly, does it not make sense > that Gluster should as well? You agreed from your experience that there are > numerous HPC-related codes that do this (which is bad), so maybe something like > an RFE is more appropriate? I'm not really sure other filesystems are handling this correctly, except to the extent that they don't create/allow the timing necessary for failure. In fact, I've been able to reproduce this using a test program even on a local filesystem. All it takes is enough iterations at a high enough rate. The only GlusterFS enhancement I can think of that would help would be an option to suppress/retry EEXIST errors, but I wouldn't be very optimistic about that being implemented. I'll defer to Vijay on that, though. In any case I'm glad to see that touching the file before a run seems to provide a workaround at least some of the time.
(In reply to comment #14) > As I understand it, the FORTRAN code is doing an open(2) without specifying > O_EXCL, and hence even with the race should not see EEXIST. My understanding is that "status='new'" in Fortran does imply that O_EXCL would already be present, and when I stepped through the code I'm pretty sure we were going through the second half of that if-statement. If you look at server_create and server_create_resume (.../protocol/server/src/server-protocol.c) and the equivalent on the client side, it does look like we're passing flags we got from elsewhere.
(In reply to comment #16) > My understanding is that "status='new'" in Fortran does imply that O_EXCL would > already be present, and when I stepped through the code I'm pretty sure we were > going through the second half of that if-statement. If you look at > server_create and server_create_resume > (.../protocol/server/src/server-protocol.c) and the equivalent on the client > side, it does look like we're passing flags we got from elsewhere. Ok, if Fortran itself is passing O_EXCL then there is no way to avoid this race.
The fortran code is actually calling OPEN with STATUS='UNKNOWN'. This should trigger a call with STATUS='NEW' if an OPEN with STATUS='OLD' fails. Some searching around indicates that STATUS='NEW' acts like O_CREAT|O_EXCL so it looks like Jeff has this pretty much down. I've gone ahead and modified my user's submission tools to include a touch at the beginning of every run on the files in question. It would probably be worthwhile for me to submit a patch to the developers of my application to fix this issue. Thanks for your help guys!
3.0.5 is done some time back, and we will address this in either 3.0.6 or 3.1.1 (if it still exists in 3.1.0 release)
(In reply to comment #18) > The fortran code is actually calling OPEN with STATUS='UNKNOWN'. This should > trigger a call with STATUS='NEW' if an OPEN with STATUS='OLD' fails. Some > searching around indicates that STATUS='NEW' acts like O_CREAT|O_EXCL so it > looks like Jeff has this pretty much down. I've gone ahead and modified my > user's submission tools to include a touch at the beginning of every run on the > files in question. It would probably be worthwhile for me to submit a patch to > the developers of my application to fix this issue. Thanks for your help guys! Closing as per above statement.. Issue with create flags itself