Description of problem: When NFS opens a file with O_CREAT, the kernel nfs daemon checks to see if the file exists. If it does, nfsd does the *right thing* (either opens the file, or if the file was opened with O_EXCL, returns an error). If the file doesn't exist, it passes the request down to the underlying file system to do the create. Unfortunately, since nfs *knows* that the file doesn't exist, it doesn't bother to pass a nameidata structure, which would include the intent information. However with gfs or a similar cluster file system, the file could have been created on another node after nfs checks for it. If this is the case, the underlying file system needs the intent information to do the *right thing*. GFS only needs to check the the flags variable of the open_intent structure, so for GFS, a partially filled in nameidata structure would be fine. But without that information, If GFS is trying to create a file that already exists, it doesn't know if it should succeed or fail. Version-Release number of selected component (if applicable): kernel-2.6.18-1.2732.el5 This has been around since RHEL4 at least. How reproducible: always, given enough work. Steps to Reproduce: 1. setup a nfs export on top of a shared GFS file system from a machine in a cluster 2. have the nfs clients (not in the cluster), and the other cluster members constantly creating files with O_EXCL, that have the same name. Actual results: Eventually, either you will crash or you will open an already created file. Expected results: You will never crash and you will never open an already created file. Additional info:
assign this to Eric for evaluation. Because it touches all filesystem.. so this is very hairy one... but do feel free to reassign to the appropriated component owner.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Per Linda's request, assigning to esandeen...
That request was a long time ago ;) I *think* this is better suited for the nfs folks to fix, but if not, bounce it back please. :)
Personally I think its a very slippery slow for nfsd to be passing partially filled in nameidata structures... Something I'm not in favor of... Also I'm a bit surprised that GFS does not have some way to detect such races... This seems like a vary common cluster problem to having racing thread trying to create the same file... Finally, if the O_EXCL is set and the second create loses the race, why is sending back a failure a problem? The file truly does exist...
I'm not sure that I understand you. From GFS's point of view, there is no race. GFS gets a create request, and notices that the file already exists. For every case except for NFS on top of GFS, GFS will get a nameidata structure that tells it whether this create request is an O_EXCL request or not. NFS doesn't pass this information down, because it assumes that if the file doesn't exist when it receives the create request, it will still not exist when NFS passes the request down to the underlying file system. This will always be true for a single machine filesystem. There is no way to guarantee that this will be true on a clustered filesystem. Without the nameidata there is no way for GFS (or any other filesystem) to if the file was opened O_EXCL or not. So if racing threads are trying to create the same file exclusively on multiple nodes running GFS, it will always work fine. One, and only one, will succeed. Put NFS on top of GFS, and now GFS has no way to know whether or not the create requests were exclusive or not. Currently, if GFS gets a create request, and the file already exists, and there is no nameidata (this can only happen with NFS) it just assumes that the create is not exclusive. This means that exclusively opening a file on NFS over GFS can break POSIX semantics. It is possible for two threads on two seperate machines to both think that they exclusively created this file. Unfortunately, this sort of operation is sometimes used by applications to do locking, so breaking POSIX semantics here can cause some fairly large problems.
Ok.. I understand... So GFS basically needs an "exclusive" bit so it can tell what to do wrt to failing or succeeding an open(O_EXCL). I'm still not a fan of passing down a bastardized nameidata structure. I think thats just going to cause problems with other filesystems. Adding a bit to the mode field would be a bit hackish... and I can't see either one making it in upstream... What seems to be needed is way to do a lookup that would populate the nameidata structure but not actually doing the lookup... Unfortunately an interface does not exist.
This request was previously evaluated by Red Hat Product Management for inclusion in the current Red Hat Enterprise Linux release, but Red Hat was unable to resolve it in time. This request will be reviewed for a future Red Hat Enterprise Linux release.
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.