Bug 211931

Summary: nfsd needs to pass intent information to vfs_create() for GFS
Product: Red Hat Enterprise Linux 5 Reporter: Ben Marzinski <bmarzins>
Component: kernelAssignee: Steve Dickson <steved>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: aviro, dzickus, esandeen, jlayton, kanderso, staubach, steved
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-04-26 12:17:02 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ben Marzinski 2006-10-23 22:03:51 UTC
Description of problem:

When NFS opens a file with O_CREAT, the kernel nfs daemon checks to see if the
file exists. If it does, nfsd does the *right thing* (either opens the file, or
if the file was opened with O_EXCL, returns an error).  If the file doesn't
exist, it passes the request down to the underlying file system to do the
create. Unfortunately, since nfs *knows* that the file doesn't exist, it doesn't 
bother to pass a nameidata structure, which would include the intent
information. However with gfs or a similar cluster file system, the file could
have been created on another node after nfs checks for it. If this is the case, 
the underlying file system needs the intent information to do the *right thing*.
GFS only needs to check the the flags variable of the open_intent structure, so
for GFS, a partially filled in nameidata structure would be fine. But without
that information, If GFS is trying to create a file that already exists, it
doesn't know if it should succeed or fail.

Version-Release number of selected component (if applicable):

kernel-2.6.18-1.2732.el5
This has been around since RHEL4 at least.

How reproducible:

always, given enough work.

Steps to Reproduce:
1. setup a nfs export on top of a shared GFS file system from a machine in a cluster
2. have the nfs clients (not in the cluster), and the other cluster members
constantly creating files with O_EXCL, that have the same name.
  
Actual results:

Eventually, either you will crash or you will open an already created file.

Expected results:

You will never crash and you will never open an already created file.

Additional info:

Comment 1 Linda Wang 2006-12-04 17:28:42 UTC
assign this to Eric for evaluation.  Because it touches all filesystem..
so this is very hairy one...  but do feel free to reassign to the appropriated
component owner.

Comment 2 RHEL Program Management 2007-04-25 23:13:06 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Steve Dickson 2007-05-09 12:49:07 UTC
Per Linda's request, assigning to esandeen... 

Comment 5 Eric Sandeen 2007-05-09 14:59:07 UTC
That request was a long time ago ;)  I *think* this is better suited for the nfs
folks to fix, but if not, bounce it back please.  :)

Comment 6 Steve Dickson 2007-05-30 14:09:08 UTC
Personally I think its a very slippery slow for
nfsd to be passing partially filled in nameidata 
structures... Something I'm not in favor of... 
 
Also I'm a bit surprised that GFS does not have some way to 
detect such races... This seems like a vary common cluster 
problem to having racing thread trying to create the same
file... 

Finally, if the O_EXCL is set and the second create
loses the race, why is sending back a failure a problem?
The file truly does exist...

Comment 7 Ben Marzinski 2007-05-30 21:57:28 UTC
I'm not sure that I understand you. From GFS's point of view, there is no race.
GFS gets a create request, and notices that the file already exists.  For every
case except for NFS on top of GFS, GFS will get a nameidata structure that tells
it whether this create request is an O_EXCL request or not.  NFS doesn't pass
this information down, because it assumes that if the file doesn't exist when it
receives the create request, it will still not exist when NFS passes the request
down to the underlying file system. This will always be true for a single
machine filesystem.  There is no way to guarantee that this will be true on a
clustered filesystem.  Without the nameidata there is no way for GFS (or any
other filesystem) to if the file was opened O_EXCL or not.

So if racing threads are trying to create the same file exclusively on multiple
nodes running GFS, it will always work fine. One, and only one, will succeed.
Put NFS on top of GFS, and now GFS has no way to know whether or not the create
requests were exclusive or not.  Currently, if GFS gets a create request, and
the file already exists, and there is no nameidata (this can only happen with
NFS) it just assumes that the create is not exclusive.  This means that
exclusively opening a file on NFS over GFS can break POSIX semantics.  It is
possible for two threads on two seperate machines to both think that they
exclusively created this file. Unfortunately, this sort of operation is
sometimes used by applications to do locking, so breaking POSIX semantics here
can cause some fairly large problems.

Comment 8 Steve Dickson 2007-05-31 13:02:47 UTC
Ok.. I understand... So GFS basically needs an "exclusive" bit so 
it can tell what to do wrt to failing or succeeding an open(O_EXCL).

I'm still not a fan of passing down a bastardized nameidata structure. 
I think thats just going to cause problems with other filesystems.
Adding a bit to the mode field would be a bit hackish... and 
I can't see either one making it in upstream... 

What seems to be needed is way to do a lookup that would populate the 
nameidata structure but not actually doing the lookup... 
Unfortunately an interface does not exist.


Comment 9 RHEL Program Management 2007-09-07 20:02:51 UTC
This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.

Comment 10 RHEL Program Management 2008-04-26 12:17:02 UTC
Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.