Red Hat Bugzilla – Bug 178057
Permission denied errors with NFS mounts to GFS when moving virtual IP's between cluster nodes
Last modified: 2007-11-30 17:07:22 EST
Created attachment 123308 [details]
Script to run on NFS client to generate I/O
Description of problem: With an NFS client mounting a GFS file system over a
virtual IP managed by cluster manager, moving the virtual IP between nodes
while running I/O to the NFS mount can cause permission denied errors on the
Version-Release number of selected component (if applicable):
Frequenty but not every time
Steps to Reproduce:
1. Create a GFS cluster with a IP service managed by cluster manager
2. Mount a GFS file system from an NFS client using the IP
3. Run I/O on the NFS client to the NFS mounted file system
4. Use clusvcadm to move the IP service between nodes repeatedly
Permission denied errors on NFS client
No I/O errors on NFS client
The NFS mount uses default parameters, ie. mount -t nfs
The attached scripts were used to generate I/O and to move the virtual IP.
In the test3 script, commenting out the rm file, improves the behavior but
eventually the permissison denied errors still occur. Running these scripts
repeatedly causes various symptoms. We have seen cases where the file exists
on the GFS node, but not on the NFS client. If you try to remove the file
on the NFS client, it is not there. If you then try to copy the file from
a local file system to the NFS client, you get permission denied errors.
The results are often indeterminate.
Created attachment 123309 [details]
Script to run on GFS node to move virtual IP between nodes
Created attachment 123310 [details]
Text file to copy from local file system to NFS mounted file system
Created attachment 123313 [details]
Cluster.conf used for GFS cluster
These four attachments are the actual files used to create the test scenario in
which these problems were observed. The NFS mount used all default options,
ie. mount -t nfs 10.250.5.21:/export/hhtest /hhtest.
Created attachment 123402 [details]
ls -l output for node 1
Created attachment 123403 [details]
ls -l output for node2
Repeated test with vip moving between nodes every 30 seconds. Test ran for
over 15 minutes and completed 626 iterations before failing with permission
denied error on client. Observed ls -l on node 1 and node 2 were different
(see attachments in comment 5 and 6). The vip had moved to node 2 at time of
failure. With the NFS client still mounted through the vip to node 2, the
test file (vfsbug.txt) does not show up with ls -l from the client.
One other observation, at the time of failure an ethereal trace showed that
rpc retransmits were occurring for a period of about 30 seconds. This started
on the last of 5 NFS writes which eventually completed with an access error.
This trace is available on request.
I've recreated the environment now, but not the problem yet.
I've run the scenario for about an hour without failure now.
My test systems probably have a newer (unreleased) version of the cluster code
In the past, I have noticed very similar strange attributes reported by ls when
I was running a slightly older version of the cluster suite code. This might
indicate that the problem was recently fixed, but that's not guaranteed. In the
mean time, I'll continue to run the provided scripts overnight to try to
recreate the problem.
If the test runs overnight, I will start backing out code fixes to see if the
problem reappears. Of course, if the problem recreates, I will debug it.
Also, I'd like to know what version of the cluster rpms and kernel the failing
systems are running.
My tests ran overnight without failure. So that is approximately 16 hours of
running the scripts, and more than 2.7 million successful diffs while moving the
ip service every ten seconds. My cluster.conf and scripts were identical to the
ones attached to this bz except for the following: (1) I am using a three-node
cluster instead of two, but I am only switching the ip service between two of
the nodes. (2) My cluster.conf specifies my cluster id, node names, IP
addresses and fencing specifics. (3) My cluster.conf does not specify the
additional dev2-mgmt, snapshot, or email_notifier services. Obviously, my
hardware is different too, so there are potential timing issues.
We are running the 2.6.9-22 kernel with the following rpms:
I reverted a bunch of the critical gfs code on my cluster back to
gfs-kernel-2.6.9-45 based on comment #10 above. Tests ran for an hour and a
half without failing.
I looked at all the gfs, cman and rgmanager code changes made since
gfs-kernel-2.6.9-45 but didn't see anything else that I thought would affect it
except for the gfs changes I had previously reverted.
Monday, I am going to start over with a different cluster (2-node) running the
2.6.9-22.0.1.EL kernel, and I'll start trying to reproduce the problem from
there. If that fails, I'll begin upgrading it and see if the problem goes away
with more recent code fixes.
I now have a two-node cluster with the failing system's RPMs as listed above. I
ran the scenario using the customer's scripts and data. This scenario ran
overnight, some odd 20 hours, with three and a half million diffs, all without
failure. The only difference: I'm using 32-bit i686 machines, not 64-bit x86_64
I'm beginning to think this problem is either (a) something specific to the
x86_64 architecture, like perhaps a compiler bug, or (b) extremely timing
dependent. My i686 test machines are quite a bit slower.
I'm trying to find two x86_64 machines I can set up in a cluster and run the
This bug was orginally seen as a slightly different problem. The test scripts
were run to try to understand better what was happening. I now have a network
trace for the original problem which is that with I/O running over NFS to a
GFS file system on one node, we can fail the node, cause it to get fenced, and
when the node reboots and the virtual IP fails back to the original node, we
get a permission denied error. The network trace shows that this is due to an
AUTH_ERROR at the RPC layer. In the attached network trace (in following
comment) at the original node after it has been fenced and rebooted, packet
number 106 shows the arp when the vip (10.251.253.14) moves back to the
original node. In packet number 133, the WRITE REPLY shows the AUTH_ERROR at
the RPC layer. I believe that this RPC AUTH_ERROR is the same error we have
seen running the test scripts previously attached. We have noticed that this
problem either occurs rather consistently on some days and other days it does
not occur at all. We are suspicious that it may be a network problem.
Had problems trying to attach ethereal trace file. Please let me know if you
want to see and I'll try to email it to you.
There are apparently two failing scenarios: (1) The "original problem," which
dealt with a server that went down, was fenced, and came back, and (2) Henry
Harris's failing scenario that he can recreate and I can't. For scenario (1) we
have the ethereal trace. For (2) we have the provided scripts. I still haven't
been able to recreate the failure using my machines. The two scenarios might be
the same problem; that's unproven theory at this point.
Regarding (1): I got the ethereal trace via email and analyzed it, comparing it
to my own ethereal traces of my non-failing attempts to recreate. I also sent
it to an NFS guy, Steve Dickson, to get his input. As indicated in comment #13,
the trace shows an RPC authorization code error, AUTH_ERROR.
Steve's theory (paraphrased by me) is that function svcauth_unix_set_client() is
failing when called from svc_proces() via progp->pg_authenticate(). Function
pg_authenticate() calls svcauth_unix_set_client() which looks into the cached IP
address to see if the given one exists. If it does not, which is probably the
case here, it returns SVC_DENIED which is turned into a rpc_autherr_badcred
which is seen in the ethereal trace.
This has to do with the kernel code in net/sunrpc/auth_unix.c, net/sunrpc/svc.c
and other kernel code in net/sunrpc/.
The area of kernel code having to do with RPC's authentication of unix
credentials has been changed extensively since the RHEL4 version the customer is
using. It looks like many of the changes have been crosswritten into recent FC4
kernels but not in the RHEL 2.6.9-27 kernel. It's hard to say whether
crosswriting the changes would make the problem better or worse. However, it
would sure be interesting to know. If I could recreate the failure, I could
experiment with this and figure it out. I'm in the process of checking newer
RHEL4 kernels to see if the rpc code is there.
As for IP service relocation: After you cut through the complexities of
rgmanager and how it does its job, what really happens on the servers is
/sbin/ip -f inet addr del dev eth1 <virtual ip> on the old server
/sbin/ip -f inet addr add dev eth1 <virtual ip> on the new server
So perhaps the old server's credentials are hanging around in the rpc cache.
If Steve's theory holds true, the problem might be solvable by somehow getting
rgmanager to flush the credential authentication cache after it does the "ip
addr del" on the old server, and before doing the add on the new server when
relocating the service. As Steve pointed out, that's not necessarily an easy
thing to do.
Since the kernel rpc authentication code has been modified recently, it's worth
trying to figure out if the changes fix the customer problem (or makes it worse).
Regarding (2): I have been assigned two x86_64 machines (Intel, not Opteron)
here in the lab with shared storage. I just need to put the appropriate RPMs on
and set them up in a cluster. Hopefully, that will allow me to recreate the
failing scenario in the lab. I probably won't get this done until Monday.
I'm also going to investigate whether the /sbin/ip tool does anything regarding
RPC and its authentication cache (or whether it's supposed to).
I also question whether this bz has anything to do with bz 137057 or 166458.
I'm going to look at the evidence for those bugs and see if there is any
correlation. Solving this may also solve the mystery of my previous bz 174638.
I'd like to know a few things if possible:
1. I'd like to see the /etc/exports file to see what parms are being given for
the exports in question.
2. I'd like to know whether the file IO is being done as root or some other user.
3. I'd like to know if the two nfs servers have pretty close or pretty distant
time stamps. In other words, are they time-synched at all?
(I noticed that some of the rpc cache cleanup code checks time values, and
keeping the timestamps close or distant may affect whether the problem occurs).
Running a time-changing command like date or rdate -s might affect the problem.
Now I've got a two-node cluster consisting of two x86_64 (Intel, not Opteron)
nodes with the same RPMS and kernel as documented by the customer above.
I ran their service_relocate.sh script overnight, relocating the virtual IP
every ten seconds. At the same time, I ran the nfs_client_io.sh script on both
an i686 NFS client (running RHEL4) and a x86_64 NFS client (running FC4). The
i686 client did 3.3 million diffs successfully. The x86_64 client did 2.5
million diffs successfully.
While all this was going on, I made sure the GFS filesystem had lots of I/O
happening. One of the NFS servers was continually untarring a big (235MB)
tarball of directory /usr/lib64:
for i in `seq 1 1000` ; do echo $i ; tar -xzf lib64.tgz ; done
(It did this successfully 442 times).
At the same time, the other node (the node also doing the service_relocate
script) was kept busy doing continuous recursive filesystem ls commands:
for ((a=1; a>0; a++)) ; do echo $a ; ls -alR | wc -l ; done
(It did this successfully more than 12000 times. FYI: The wc reported between
20129 and 20131 lines of ls output depending on what the other tar process was
This was all done in hope of reproducing either this bug or other GFS corruption
bugs such as bz 175589, bz 176352 or even bz 179069.
Unfortunately for this bug (but fortunately for GFS in general), all of this ran
successfully with absolutely no evidence of filesystem corruption.
I'm planning on mixing things up a bit and trying more I/O variations.
I *may* have been able to recreate the failure. I create *A* failure
I was running a variant of the script that rm'd and cp'd the file
before running the diff. My script now reports only every 100 diffs,
thus increasing the pressure on the filesystem.
Running the script almost immediately gave me an "Access Denied" error,
although it might not be the same one the customer saw.
The nfs client got:
[root@roth-03 /]# /tmp/nfs_client_io6.sh
diff: /mnt/bob/vfsbug.txt: Permission denied
Diff failed 1
The nfs client started getting (in /var/log/messages) lots of:
Feb 1 08:06:24 roth-03 kernel: NFS: Buggy server - nlink == 0!
Feb 1 08:06:24 roth-03 kernel: __nfs_fhget: iget failed
nfs_refresh_inode: inode number mismatch
expected (0xa/0x361dbd), got (0xa/0x30faff)
Upon closer examination, the file got copied, but on my nfs server node,
(the one that had the virtual ip at the moment) showed the nlink count as 0!
[root@roth-02 /]# ls -l /mnt/bob3
-rw-r--r-- 0 root root 155806 Feb 1 08:00 vfsbug.txt
But the other node (that didn't have the vip at the time) said:
-rw-r--r-- 1 root root 155806 Feb 1 08:00 vfsbug.txt
The node with the vip wouldn't let me umount the gfs mount, despite
my having exportfs -u everything and gotten out of all dirs.
(No users of the mount point were found in lsof or fuser.)
I was finally forced to reboot the node, at which point it kernel
panicked, exhibiting the (known) lock_dlm problem.
I power-cycled the virtual ip node via the power switch.
When it came back, after remounting the nfs share again on the nfs
client, an ls -l showed 0 files, but on both servers, all the files
appeared just fine!
I decided to gfs_fsck the fs. The filesystem was fixed just fine
(sorry, no bz 179069) and I continued to experiment.
Next, I tried to recreate the failure by just doing the diffs, with
no rm/cp on the nfs client, and it didn't recreate, for the short
time I ran it.
Then I tried to recreate the failure by running the failing script
(with the rm/cp) on my i686 nfs client to two i686 servers with the
failing rpms. Again, it did not recreate for the short time I ran.
Next, I tried to recreate the failure by running the failing scenario
on my i686 client to the x86_64 servers where I saw the failure earlier.
This is WITH the rm/cp commands in the script.
So in short: This didn't happen on my slower i686 machines. It only
happened when using faster x86_64 machines for both client and servers.
Incidentally, I am using: mount -o timeo=10,retrans=10
in order to avoid dozens of dmesgs that say:
nfs: server 10.15.84.248 not responding, still trying
nfs: server 10.15.84.248 OK
I was using those settings when the failure occurred.
I was able to recreate the same error a second time using the same
scenario and that is definitely good news.
My plans are to: (1) Figure out if this is a known/fixed problem
with newer rpms by upgrading my systems with newer rpms.
(2) Backtrack the problem to its source to figure out if this is a
problem with gfs or nfs (or even vfs).
This looks like the same failure as reported. We got the permission denied
error with the server that had the VIP at the time failure showing nlink
count=0. Usually it also showed all ?'s for file owner, size, permisssions,
I updated my x86_64 test systems with the latest RHEL4 U3 RPMs for
RHCS. After the upgrade, I was still able to recreate "Permission
Denied" errors using the script, but the system handled the errors
I've noticed four distinct symptoms, all of which may or may not
be related: complaints from rm, complaints from cp, complaints
from diff, and (4) complaints and/or improper permission and link
counts from ls.
1. The rm command sometimes gets:
rm: cannot remove `/mnt/bob/vfsbug.txt': No such file or directory
2. The cp command sometimes gets:
cp: cannot stat `/mnt/bob/vfsbug.txt': Permission denied
3. The diff command sometimes gets:
diff: /mnt/bob/vfsbug.txt: No such file or directory
4. After the failure, a generic ls -l command sometimes gives:
-rw-r--r-- 1 root root 155806 Jan 31 06:11 /mnt/bob3/vfsbug2.txt
-rw-r--r-- 1 root root 155806 Jan 31 06:11 /mnt/bob3/vfsbug3.txt
?--------- ? ? ? ? ? vfsbug.txt
An ls command like: ls -l /mnt/bob3/vfsbug*.txt gives:
ls: /mnt/bob3/vfsbug.txt: No such file or directory
-rw-r--r-- 1 root root 155806 Jan 31 06:11 /mnt/bob3/vfsbug2.txt
-rw-r--r-- 1 root root 155806 Jan 31 06:11 /mnt/bob3/vfsbug3.txt
I have studied problem #1 (rm is complaining) and tracked it down
to this basic sequence of events:
1. The NFS client sends RPC request V3 REMOVE to the virtual ip
to delete the file requested.
2. The virtual IP node sees the request, processes it, and
deletes the file, but before it can respond:
3. The virtual IP service is moved from that node to the other.
4. Since the NFS client never got a reply for its NFS V3 REMOVE
request, it resends it multiple times.
5. This time, the resent request is seen by the new virtual IP
owner who has never seen it before.
6. The new virtual IP owner does a gfs_dir_search for the node,
and it is now gone because the previous virtual ip owner
already processed the remove but couldn't reply to it.
7. Therefore, the virtual IP owner returns NFS3ERR_NOENT, which,
in turn, is fed back through NFS and causes rm to give the error:
rm: cannot remove `/mnt/bob/vfsbug.txt': No such file or directory.
Here's why it doesn't fail for normal NFS clients:
Under ordinarily conditions, NFS keeps a cache of requests it has
received and processed. If, for some reason, a reply packet from the
NFS server is lost, the NFS client will resend the delete request as
we see here. However, the NFS server will look in its request cache,
determine that the packet was already processed and acked, and
assuming the ack was merely lost, it will resend the ack.
The scenario here fails because the virtual IP has moved and therefore
the NFS client has no idea the RPC request was already serviced.
The delete request is not in its cached NFS requests.
The cp problem is very similar to the rm problem, but I'm still
analyzing the data (with the help of Steve Dickson.)
The diff problem is very similar too, because we're playing
"Now you see it, now you don't" with our underlying filesystem.
The problem could be a nightmare to fix. I'm not convinced it's even
possible to fix it given the circumstances and the sequence of events.
There are comments in gfs's inode.c that lead me to believe we can't
protect the operation with glocks due to deadlocks we'd introduce.
And without any kind of glock across the whole operation, there is no
way for the two clients to know what packets the other has processed.
In other words, GFS can protect the delete from the filesystem itself,
but we can't protect the underlying RPC communications.
Our best bet might be to let RPC ack the delete request BEFORE we
actually process it. That would at least tighten the window. But
that is an issue between NFS and RPC, not GFS. I'm going to keep
an open dialog with the NFS/RPC people on this.
What disturbs me most is the "ls" problem. I need to study that
problem more. Regardless of the conditions and what has happened
to that file, I believe GFS should report the file permissions,
attributes and nlink count correctly. That's what I'm going to
One thing seems apparent: Since upgrading to the U3 RPMS for RHCS,
I haven't seen the filesystem corruption and umount problems as before.
The problem with cp (problem 2 above) is similar to the rm case:
1. NFS Server 1's cp creates the file via RPC V3 CREATE, but can't
reply because the virtual IP is being moved.
2. NFS Server 2 gets control of the virtual IP
3. NFS client, having not seen a reply, sends a duplicate V3 CREATE
4. NEW Server 2 sees the duplicate V3 CREATE and tries to create the
file. However, it now sees the file server 1 already created
in step 1 so it throws the error. Therefore, cp errors out with
Problem 3 (with cp and diff) looks like this:
1. Server 1's create works. (V3 CREATE)
3. Server 1's cp starts pushing data to the file (V3 WRITE)
4. While the writes are going on, the virtual p is moved from
server 1 to 2.
5. The write requests to the virtual IP go unanswered and get resent.
6. Server 2 finally gets the duplicate write requests, then V3 COMMIT.
The V3 WRITE and COMMIT return NFS3ERR_ACCES.
It seems like problem 4 has something to do with the VFS inode cache...
I'm still investigating.
I sent another long email and more traces to Steve Dickson. This seems
related more to NFS and RPC than GFS and Cluster Suite.
*** Bug 174638 has been marked as a duplicate of this bug. ***
The first problem is due to the duplicate request cache NFS servers
maintain . The cache needs to be moved over to the new server (which
will not be that easy) or possibly be mirror throughout the cluster??
The second problem (probably) has something to do with the fact
the duplicated write is arriving at the new server before
the exports are set up (or moved over) correctly. If this
in deeded is the case, it appears we will need some type
of fencing mechanism that will cause the new server to incoming
drop the messages until the exports are set up...
Just a note that QA is also running into this issue as well. We saw the
following strange message many times on one on the new VIP servers before the
problems started: find_exported_dentry: npd != pd
I would like to close this bugzilla as a dup of bugzilla 132823 with the
following summary statement:
With current state of linux kernels (both RHEL and upstream), NFS failovers
have been error-prone. An example is the "non-idempotent" issues associated
with operations such as "rm" or "rename". This could start with (nfs) client
sending in a file remove request ("rm"). The (nfs) server passes the call
into filesystem and somehow gets stuck there for a while (say waiting for a
directory lock). Timeout occurs in client side and re-transmit happens.
Current linux NFSD is designed to handle this issue by a global "request
reply cache" where each request is checked with entries in the cache. If a
duplication is ever found, the re-transmitted request is subsequently dropped.
However, in a failover scenario, if the back-up server gets the re-transmitted
request, and if there is no equivialent entry in its own reply cache to
protect this duplication, the client may end up getting various errors
(ESTALE, EPERM, etc) return code, regardless the request has been carried
out and succeeded in the original server.
To ease these inherited NFS protocol issues, we've tentatively identified
several work items that we plan to address in next few RHEL updates and/or
releases via bugzilla 132823. Sorry that we have to flag the planned changes
as private events to avoid false expectations, mostly because the work items
are subject to the acceptance of upstrem linux kernel community and RHEL
overall product directions.
Before we complete the work via bz 132823, for NFS v2/V3, RHEL 4.4 has the
B-1: Unless NFS client applications can tolerate ESTALE and/or EPERM errors,
IO activities on the failover ip interface must be temporarily quiesced
until active-active failover transition completes. This is to avoid
non-idempotent NFS operation failure on the new server. (check out
"Why NFS Sucks" by Olaf Kirch, placed as "kirch-reprint.pdf" in 2006
B-2: With various possible base kernel bugs outside RHCS' control, there
are possibilities that local filesystem (such as ext3) umount could
fail. To ensure data integrity, RHCS will abort the failover. Admin
could either specify the self-fence (reboot taken-over server) option
to force failover (via cluster.conf file) or re-mount the filesystem
on the taken-over server as ro (read-only) to allow failover. Both
options have the possibility of losing data.
B-3: If nfs client invokes NLM locking call, the subject nfs servers (both
take-over and take-over) will enter a global 90-second (tunable)
locking grace period for every nfs service on the servers.
B-4: If NFS-TCP is involved, failover should not be issued on the same pair
of machines multiple times within 30-minute period; for example,
failing over from node A to B, then immediately failing from B back to
A would hang the connection. This is to avoid TCP TIME_WAIT issue.
For NFS V4 failover, the issues were archived in:
The summary can be highlighted by NFS V4 developer J. Bruce Fields' reply as
> I have been working on setting up NFSv4 in cluster scenario. But when
> it comes to an Active-Active setup, it seems there are a few problems.
Yes. Even active-passive failover is problematic right now, mainly
because the method we use for storing reboot recovery information
(analagous to the state recorded by statd) is still in flux.
We're aware of these problems and working on solving them. There are
some rough (possibly out of date notes) here:
but those are more of interest to someone working on design of a
solution than to someone trying to set up and use any of this (which we
don't recommend doing yet).
Before the general NFS v4 dust settled for linux, we don't plan to address
NFS V4 failover issues in bugzilla 132823 in order to keep the problem in a
managable (and deliverable) state.
*** This bug has been marked as a duplicate of 132823 ***