Red Hat Bugzilla – Bug 211931
nfsd needs to pass intent information to vfs_create() for GFS
Last modified: 2008-04-26 08:17:02 EDT
Description of problem:
When NFS opens a file with O_CREAT, the kernel nfs daemon checks to see if the
file exists. If it does, nfsd does the *right thing* (either opens the file, or
if the file was opened with O_EXCL, returns an error). If the file doesn't
exist, it passes the request down to the underlying file system to do the
create. Unfortunately, since nfs *knows* that the file doesn't exist, it doesn't
bother to pass a nameidata structure, which would include the intent
information. However with gfs or a similar cluster file system, the file could
have been created on another node after nfs checks for it. If this is the case,
the underlying file system needs the intent information to do the *right thing*.
GFS only needs to check the the flags variable of the open_intent structure, so
for GFS, a partially filled in nameidata structure would be fine. But without
that information, If GFS is trying to create a file that already exists, it
doesn't know if it should succeed or fail.
Version-Release number of selected component (if applicable):
This has been around since RHEL4 at least.
always, given enough work.
Steps to Reproduce:
1. setup a nfs export on top of a shared GFS file system from a machine in a cluster
2. have the nfs clients (not in the cluster), and the other cluster members
constantly creating files with O_EXCL, that have the same name.
Eventually, either you will crash or you will open an already created file.
You will never crash and you will never open an already created file.
assign this to Eric for evaluation. Because it touches all filesystem..
so this is very hairy one... but do feel free to reassign to the appropriated
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
Per Linda's request, assigning to esandeen...
That request was a long time ago ;) I *think* this is better suited for the nfs
folks to fix, but if not, bounce it back please. :)
Personally I think its a very slippery slow for
nfsd to be passing partially filled in nameidata
structures... Something I'm not in favor of...
Also I'm a bit surprised that GFS does not have some way to
detect such races... This seems like a vary common cluster
problem to having racing thread trying to create the same
Finally, if the O_EXCL is set and the second create
loses the race, why is sending back a failure a problem?
The file truly does exist...
I'm not sure that I understand you. From GFS's point of view, there is no race.
GFS gets a create request, and notices that the file already exists. For every
case except for NFS on top of GFS, GFS will get a nameidata structure that tells
it whether this create request is an O_EXCL request or not. NFS doesn't pass
this information down, because it assumes that if the file doesn't exist when it
receives the create request, it will still not exist when NFS passes the request
down to the underlying file system. This will always be true for a single
machine filesystem. There is no way to guarantee that this will be true on a
clustered filesystem. Without the nameidata there is no way for GFS (or any
other filesystem) to if the file was opened O_EXCL or not.
So if racing threads are trying to create the same file exclusively on multiple
nodes running GFS, it will always work fine. One, and only one, will succeed.
Put NFS on top of GFS, and now GFS has no way to know whether or not the create
requests were exclusive or not. Currently, if GFS gets a create request, and
the file already exists, and there is no nameidata (this can only happen with
NFS) it just assumes that the create is not exclusive. This means that
exclusively opening a file on NFS over GFS can break POSIX semantics. It is
possible for two threads on two seperate machines to both think that they
exclusively created this file. Unfortunately, this sort of operation is
sometimes used by applications to do locking, so breaking POSIX semantics here
can cause some fairly large problems.
Ok.. I understand... So GFS basically needs an "exclusive" bit so
it can tell what to do wrt to failing or succeeding an open(O_EXCL).
I'm still not a fan of passing down a bastardized nameidata structure.
I think thats just going to cause problems with other filesystems.
Adding a bit to the mode field would be a bit hackish... and
I can't see either one making it in upstream...
What seems to be needed is way to do a lookup that would populate the
nameidata structure but not actually doing the lookup...
Unfortunately an interface does not exist.
This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time. This request will be
reviewed for a future Red Hat Enterprise Linux release.
Development Management has reviewed and declined this request. You may appeal
this decision by reopening this request.