Created attachment 123308 [details] Script to run on NFS client to generate I/O
Description of problem: With an NFS client mounting a GFS file system over a virtual IP managed by cluster manager, moving the virtual IP between nodes while running I/O to the NFS mount can cause permission denied errors on the NFS client. Version-Release number of selected component (if applicable): How reproducible: Frequenty but not every time Steps to Reproduce: 1. Create a GFS cluster with a IP service managed by cluster manager 2. Mount a GFS file system from an NFS client using the IP 3. Run I/O on the NFS client to the NFS mounted file system 4. Use clusvcadm to move the IP service between nodes repeatedly Actual results: Permission denied errors on NFS client Expected results: No I/O errors on NFS client Additional info: The NFS mount uses default parameters, ie. mount -t nfs 10.250.5.21:/export/hhtest /hhtest The attached scripts were used to generate I/O and to move the virtual IP. In the test3 script, commenting out the rm file, improves the behavior but eventually the permissison denied errors still occur. Running these scripts repeatedly causes various symptoms. We have seen cases where the file exists on the GFS node, but not on the NFS client. If you try to remove the file on the NFS client, it is not there. If you then try to copy the file from a local file system to the NFS client, you get permission denied errors. The results are often indeterminate.
Created attachment 123309 [details] Script to run on GFS node to move virtual IP between nodes
Created attachment 123310 [details] Text file to copy from local file system to NFS mounted file system
Created attachment 123313 [details] Cluster.conf used for GFS cluster These four attachments are the actual files used to create the test scenario in which these problems were observed. The NFS mount used all default options, ie. mount -t nfs 10.250.5.21:/export/hhtest /hhtest.
Created attachment 123402 [details] ls -l output for node 1
Created attachment 123403 [details] ls -l output for node2
Repeated test with vip moving between nodes every 30 seconds. Test ran for over 15 minutes and completed 626 iterations before failing with permission denied error on client. Observed ls -l on node 1 and node 2 were different (see attachments in comment 5 and 6). The vip had moved to node 2 at time of failure. With the NFS client still mounted through the vip to node 2, the test file (vfsbug.txt) does not show up with ls -l from the client. One other observation, at the time of failure an ethereal trace showed that rpc retransmits were occurring for a period of about 30 seconds. This started on the last of 5 NFS writes which eventually completed with an access error. This trace is available on request.
I've recreated the environment now, but not the problem yet. I've run the scenario for about an hour without failure now. My test systems probably have a newer (unreleased) version of the cluster code though. In the past, I have noticed very similar strange attributes reported by ls when I was running a slightly older version of the cluster suite code. This might indicate that the problem was recently fixed, but that's not guaranteed. In the mean time, I'll continue to run the provided scripts overnight to try to recreate the problem. If the test runs overnight, I will start backing out code fixes to see if the problem reappears. Of course, if the problem recreates, I will debug it. Also, I'd like to know what version of the cluster rpms and kernel the failing systems are running.
My tests ran overnight without failure. So that is approximately 16 hours of running the scripts, and more than 2.7 million successful diffs while moving the ip service every ten seconds. My cluster.conf and scripts were identical to the ones attached to this bz except for the following: (1) I am using a three-node cluster instead of two, but I am only switching the ip service between two of the nodes. (2) My cluster.conf specifies my cluster id, node names, IP addresses and fencing specifics. (3) My cluster.conf does not specify the additional dev2-mgmt, snapshot, or email_notifier services. Obviously, my hardware is different too, so there are potential timing issues.
We are running the 2.6.9-22 kernel with the following rpms: ccs-1.0.2-0.x86_64.rpm ccs-debuginfo-1.0.2-0.x86_64.rpm ccs-devel-1.0.2-0.x86_64.rpm cman-1.0.4-0.x86_64.rpm cman-debuginfo-1.0.4-0.x86_64.rpm cman-devel-1.0.4-0.x86_64.rpm cman-kernheaders-2.6.9-41.0.x86_64.rpm dlm-1.0.0-5.x86_64.rpm dlm-debuginfo-1.0.0-5.x86_64.rpm dlm-devel-1.0.0-5.x86_64.rpm dlm-kernheaders-2.6.9-39.1.x86_64.rpm dmraid-1.0.0.rc8-1_RHEL4_U2.x86_64.rpm dmraid-debuginfo-1.0.0.rc8-1_RHEL4_U2.x86_64.rpm fence-1.32.10-0.x86_64.rpm fence-debuginfo-1.32.10-0.x86_64.rpm GFS-6.1.3-0.x86_64.rpm GFS-debuginfo-6.1.3-0.x86_64.rpm GFS-kernheaders-2.6.9-45.0.x86_64.rpm gnbd-1.0.2-0.x86_64.rpm gnbd-debuginfo-1.0.2-0.x86_64.rpm gnbd-kernheaders-2.6.9-9.14.x86_64.rpm gulm-1.0.4-0.x86_64.rpm gulm-debuginfo-1.0.4-0.x86_64.rpm gulm-devel-1.0.4-0.x86_64.rpm iddev-2.0.0-3.x86_64.rpm iddev-debuginfo-2.0.0-3.x86_64.rpm iddev-devel-2.0.0-3.x86_64.rpm ipvsadm-1.24-6.x86_64.rpm ipvsadm-debuginfo-1.24-6.x86_64.rpm lvm2-cluster-2.01.14-1.0.RHEL4.x86_64.rpm lvm2-cluster-debuginfo-2.01.14-1.0.RHEL4.x86_64.rpm magma-1.0.3-2.x86_64.rpm magma-debuginfo-1.0.3-2.x86_64.rpm magma-devel-1.0.3-2.x86_64.rpm magma-plugins-1.0.5-0.x86_64.rpm magma-plugins-debuginfo-1.0.5-0.x86_64.rpm perl-Net-Telnet-3.03-3.noarch.rpm piranha-0.8.1-1.x86_64.rpm piranha-debuginfo-0.8.1-1.x86_64.rpm rgmanager-1.9.43-0.x86_64.rpm rgmanager-debuginfo-1.9.43-0.x86_64.rpm system-config-cluster-1.0.16-1.0.noarch.rpm
I reverted a bunch of the critical gfs code on my cluster back to gfs-kernel-2.6.9-45 based on comment #10 above. Tests ran for an hour and a half without failing. I looked at all the gfs, cman and rgmanager code changes made since gfs-kernel-2.6.9-45 but didn't see anything else that I thought would affect it except for the gfs changes I had previously reverted. Monday, I am going to start over with a different cluster (2-node) running the 2.6.9-22.0.1.EL kernel, and I'll start trying to reproduce the problem from there. If that fails, I'll begin upgrading it and see if the problem goes away with more recent code fixes.
I now have a two-node cluster with the failing system's RPMs as listed above. I ran the scenario using the customer's scripts and data. This scenario ran overnight, some odd 20 hours, with three and a half million diffs, all without failure. The only difference: I'm using 32-bit i686 machines, not 64-bit x86_64 machines. I'm beginning to think this problem is either (a) something specific to the x86_64 architecture, like perhaps a compiler bug, or (b) extremely timing dependent. My i686 test machines are quite a bit slower. I'm trying to find two x86_64 machines I can set up in a cluster and run the same scenario.
This bug was orginally seen as a slightly different problem. The test scripts were run to try to understand better what was happening. I now have a network trace for the original problem which is that with I/O running over NFS to a GFS file system on one node, we can fail the node, cause it to get fenced, and when the node reboots and the virtual IP fails back to the original node, we get a permission denied error. The network trace shows that this is due to an AUTH_ERROR at the RPC layer. In the attached network trace (in following comment) at the original node after it has been fenced and rebooted, packet number 106 shows the arp when the vip (10.251.253.14) moves back to the original node. In packet number 133, the WRITE REPLY shows the AUTH_ERROR at the RPC layer. I believe that this RPC AUTH_ERROR is the same error we have seen running the test scripts previously attached. We have noticed that this problem either occurs rather consistently on some days and other days it does not occur at all. We are suspicious that it may be a network problem.
Had problems trying to attach ethereal trace file. Please let me know if you want to see and I'll try to email it to you.
There are apparently two failing scenarios: (1) The "original problem," which dealt with a server that went down, was fenced, and came back, and (2) Henry Harris's failing scenario that he can recreate and I can't. For scenario (1) we have the ethereal trace. For (2) we have the provided scripts. I still haven't been able to recreate the failure using my machines. The two scenarios might be the same problem; that's unproven theory at this point. Regarding (1): I got the ethereal trace via email and analyzed it, comparing it to my own ethereal traces of my non-failing attempts to recreate. I also sent it to an NFS guy, Steve Dickson, to get his input. As indicated in comment #13, the trace shows an RPC authorization code error, AUTH_ERROR. Steve's theory (paraphrased by me) is that function svcauth_unix_set_client() is failing when called from svc_proces() via progp->pg_authenticate(). Function pg_authenticate() calls svcauth_unix_set_client() which looks into the cached IP address to see if the given one exists. If it does not, which is probably the case here, it returns SVC_DENIED which is turned into a rpc_autherr_badcred which is seen in the ethereal trace. This has to do with the kernel code in net/sunrpc/auth_unix.c, net/sunrpc/svc.c and other kernel code in net/sunrpc/. The area of kernel code having to do with RPC's authentication of unix credentials has been changed extensively since the RHEL4 version the customer is using. It looks like many of the changes have been crosswritten into recent FC4 kernels but not in the RHEL 2.6.9-27 kernel. It's hard to say whether crosswriting the changes would make the problem better or worse. However, it would sure be interesting to know. If I could recreate the failure, I could experiment with this and figure it out. I'm in the process of checking newer RHEL4 kernels to see if the rpc code is there. As for IP service relocation: After you cut through the complexities of rgmanager and how it does its job, what really happens on the servers is essentially this: /sbin/ip -f inet addr del dev eth1 <virtual ip> on the old server /sbin/ip -f inet addr add dev eth1 <virtual ip> on the new server So perhaps the old server's credentials are hanging around in the rpc cache. If Steve's theory holds true, the problem might be solvable by somehow getting rgmanager to flush the credential authentication cache after it does the "ip addr del" on the old server, and before doing the add on the new server when relocating the service. As Steve pointed out, that's not necessarily an easy thing to do. Since the kernel rpc authentication code has been modified recently, it's worth trying to figure out if the changes fix the customer problem (or makes it worse). Regarding (2): I have been assigned two x86_64 machines (Intel, not Opteron) here in the lab with shared storage. I just need to put the appropriate RPMs on and set them up in a cluster. Hopefully, that will allow me to recreate the failing scenario in the lab. I probably won't get this done until Monday. I'm also going to investigate whether the /sbin/ip tool does anything regarding RPC and its authentication cache (or whether it's supposed to). I also question whether this bz has anything to do with bz 137057 or 166458. I'm going to look at the evidence for those bugs and see if there is any correlation. Solving this may also solve the mystery of my previous bz 174638. I'd like to know a few things if possible: 1. I'd like to see the /etc/exports file to see what parms are being given for the exports in question. 2. I'd like to know whether the file IO is being done as root or some other user. 3. I'd like to know if the two nfs servers have pretty close or pretty distant time stamps. In other words, are they time-synched at all? (I noticed that some of the rpc cache cleanup code checks time values, and keeping the timestamps close or distant may affect whether the problem occurs). Running a time-changing command like date or rdate -s might affect the problem.
Now I've got a two-node cluster consisting of two x86_64 (Intel, not Opteron) nodes with the same RPMS and kernel as documented by the customer above. I ran their service_relocate.sh script overnight, relocating the virtual IP every ten seconds. At the same time, I ran the nfs_client_io.sh script on both an i686 NFS client (running RHEL4) and a x86_64 NFS client (running FC4). The i686 client did 3.3 million diffs successfully. The x86_64 client did 2.5 million diffs successfully. While all this was going on, I made sure the GFS filesystem had lots of I/O happening. One of the NFS servers was continually untarring a big (235MB) tarball of directory /usr/lib64: for i in `seq 1 1000` ; do echo $i ; tar -xzf lib64.tgz ; done (It did this successfully 442 times). At the same time, the other node (the node also doing the service_relocate script) was kept busy doing continuous recursive filesystem ls commands: for ((a=1; a>0; a++)) ; do echo $a ; ls -alR | wc -l ; done (It did this successfully more than 12000 times. FYI: The wc reported between 20129 and 20131 lines of ls output depending on what the other tar process was doing.) This was all done in hope of reproducing either this bug or other GFS corruption bugs such as bz 175589, bz 176352 or even bz 179069. Unfortunately for this bug (but fortunately for GFS in general), all of this ran successfully with absolutely no evidence of filesystem corruption. I'm planning on mixing things up a bit and trying more I/O variations.
I *may* have been able to recreate the failure. I create *A* failure at least. I was running a variant of the script that rm'd and cp'd the file before running the diff. My script now reports only every 100 diffs, thus increasing the pressure on the filesystem. Running the script almost immediately gave me an "Access Denied" error, although it might not be the same one the customer saw. The nfs client got: [root@roth-03 /]# /tmp/nfs_client_io6.sh diff: /mnt/bob/vfsbug.txt: Permission denied Diff failed 1 The nfs client started getting (in /var/log/messages) lots of: Feb 1 08:06:24 roth-03 kernel: NFS: Buggy server - nlink == 0! Feb 1 08:06:24 roth-03 kernel: __nfs_fhget: iget failed dmesg showed: nfs_refresh_inode: inode number mismatch expected (0xa/0x361dbd), got (0xa/0x30faff) Upon closer examination, the file got copied, but on my nfs server node, (the one that had the virtual ip at the moment) showed the nlink count as 0! I saw: [root@roth-02 /]# ls -l /mnt/bob3 (snip) -rw-r--r-- 0 root root 155806 Feb 1 08:00 vfsbug.txt But the other node (that didn't have the vip at the time) said: -rw-r--r-- 1 root root 155806 Feb 1 08:00 vfsbug.txt The node with the vip wouldn't let me umount the gfs mount, despite my having exportfs -u everything and gotten out of all dirs. (No users of the mount point were found in lsof or fuser.) I was finally forced to reboot the node, at which point it kernel panicked, exhibiting the (known) lock_dlm problem. I power-cycled the virtual ip node via the power switch. When it came back, after remounting the nfs share again on the nfs client, an ls -l showed 0 files, but on both servers, all the files appeared just fine! I decided to gfs_fsck the fs. The filesystem was fixed just fine (sorry, no bz 179069) and I continued to experiment. Next, I tried to recreate the failure by just doing the diffs, with no rm/cp on the nfs client, and it didn't recreate, for the short time I ran it. Then I tried to recreate the failure by running the failing script (with the rm/cp) on my i686 nfs client to two i686 servers with the failing rpms. Again, it did not recreate for the short time I ran. Next, I tried to recreate the failure by running the failing scenario on my i686 client to the x86_64 servers where I saw the failure earlier. This is WITH the rm/cp commands in the script. So in short: This didn't happen on my slower i686 machines. It only happened when using faster x86_64 machines for both client and servers. Incidentally, I am using: mount -o timeo=10,retrans=10 in order to avoid dozens of dmesgs that say: nfs: server 10.15.84.248 not responding, still trying nfs: server 10.15.84.248 OK I was using those settings when the failure occurred. I was able to recreate the same error a second time using the same scenario and that is definitely good news. My plans are to: (1) Figure out if this is a known/fixed problem with newer rpms by upgrading my systems with newer rpms. (2) Backtrack the problem to its source to figure out if this is a problem with gfs or nfs (or even vfs).
This looks like the same failure as reported. We got the permission denied error with the server that had the VIP at the time failure showing nlink count=0. Usually it also showed all ?'s for file owner, size, permisssions, etc.
I updated my x86_64 test systems with the latest RHEL4 U3 RPMs for RHCS. After the upgrade, I was still able to recreate "Permission Denied" errors using the script, but the system handled the errors much better. I've noticed four distinct symptoms, all of which may or may not be related: complaints from rm, complaints from cp, complaints from diff, and (4) complaints and/or improper permission and link counts from ls. 1. The rm command sometimes gets: rm: cannot remove `/mnt/bob/vfsbug.txt': No such file or directory 2. The cp command sometimes gets: cp: cannot stat `/mnt/bob/vfsbug.txt': Permission denied 3. The diff command sometimes gets: diff: /mnt/bob/vfsbug.txt: No such file or directory 4. After the failure, a generic ls -l command sometimes gives: -rw-r--r-- 1 root root 155806 Jan 31 06:11 /mnt/bob3/vfsbug2.txt -rw-r--r-- 1 root root 155806 Jan 31 06:11 /mnt/bob3/vfsbug3.txt ?--------- ? ? ? ? ? vfsbug.txt An ls command like: ls -l /mnt/bob3/vfsbug*.txt gives: ls: /mnt/bob3/vfsbug.txt: No such file or directory -rw-r--r-- 1 root root 155806 Jan 31 06:11 /mnt/bob3/vfsbug2.txt -rw-r--r-- 1 root root 155806 Jan 31 06:11 /mnt/bob3/vfsbug3.txt I have studied problem #1 (rm is complaining) and tracked it down to this basic sequence of events: 1. The NFS client sends RPC request V3 REMOVE to the virtual ip to delete the file requested. 2. The virtual IP node sees the request, processes it, and deletes the file, but before it can respond: 3. The virtual IP service is moved from that node to the other. 4. Since the NFS client never got a reply for its NFS V3 REMOVE request, it resends it multiple times. 5. This time, the resent request is seen by the new virtual IP owner who has never seen it before. 6. The new virtual IP owner does a gfs_dir_search for the node, and it is now gone because the previous virtual ip owner already processed the remove but couldn't reply to it. 7. Therefore, the virtual IP owner returns NFS3ERR_NOENT, which, in turn, is fed back through NFS and causes rm to give the error: rm: cannot remove `/mnt/bob/vfsbug.txt': No such file or directory. Here's why it doesn't fail for normal NFS clients: Under ordinarily conditions, NFS keeps a cache of requests it has received and processed. If, for some reason, a reply packet from the NFS server is lost, the NFS client will resend the delete request as we see here. However, the NFS server will look in its request cache, determine that the packet was already processed and acked, and assuming the ack was merely lost, it will resend the ack. The scenario here fails because the virtual IP has moved and therefore the NFS client has no idea the RPC request was already serviced. The delete request is not in its cached NFS requests. The cp problem is very similar to the rm problem, but I'm still analyzing the data (with the help of Steve Dickson.) The diff problem is very similar too, because we're playing "Now you see it, now you don't" with our underlying filesystem. The problem could be a nightmare to fix. I'm not convinced it's even possible to fix it given the circumstances and the sequence of events. There are comments in gfs's inode.c that lead me to believe we can't protect the operation with glocks due to deadlocks we'd introduce. And without any kind of glock across the whole operation, there is no way for the two clients to know what packets the other has processed. In other words, GFS can protect the delete from the filesystem itself, but we can't protect the underlying RPC communications. Our best bet might be to let RPC ack the delete request BEFORE we actually process it. That would at least tighten the window. But that is an issue between NFS and RPC, not GFS. I'm going to keep an open dialog with the NFS/RPC people on this. What disturbs me most is the "ls" problem. I need to study that problem more. Regardless of the conditions and what has happened to that file, I believe GFS should report the file permissions, attributes and nlink count correctly. That's what I'm going to focus on. One thing seems apparent: Since upgrading to the U3 RPMS for RHCS, I haven't seen the filesystem corruption and umount problems as before.
The problem with cp (problem 2 above) is similar to the rm case: 1. NFS Server 1's cp creates the file via RPC V3 CREATE, but can't reply because the virtual IP is being moved. 2. NFS Server 2 gets control of the virtual IP 3. NFS client, having not seen a reply, sends a duplicate V3 CREATE request. 4. NEW Server 2 sees the duplicate V3 CREATE and tries to create the file. However, it now sees the file server 1 already created in step 1 so it throws the error. Therefore, cp errors out with Permission Denied. Problem 3 (with cp and diff) looks like this: 1. Server 1's create works. (V3 CREATE) 3. Server 1's cp starts pushing data to the file (V3 WRITE) 4. While the writes are going on, the virtual p is moved from server 1 to 2. 5. The write requests to the virtual IP go unanswered and get resent. 6. Server 2 finally gets the duplicate write requests, then V3 COMMIT. The V3 WRITE and COMMIT return NFS3ERR_ACCES. It seems like problem 4 has something to do with the VFS inode cache... I'm still investigating. I sent another long email and more traces to Steve Dickson. This seems related more to NFS and RPC than GFS and Cluster Suite.
*** Bug 174638 has been marked as a duplicate of this bug. ***
The first problem is due to the duplicate request cache NFS servers maintain . The cache needs to be moved over to the new server (which will not be that easy) or possibly be mirror throughout the cluster?? The second problem (probably) has something to do with the fact the duplicated write is arriving at the new server before the exports are set up (or moved over) correctly. If this in deeded is the case, it appears we will need some type of fencing mechanism that will cause the new server to incoming drop the messages until the exports are set up...
Just a note that QA is also running into this issue as well. We saw the following strange message many times on one on the new VIP servers before the problems started: find_exported_dentry: npd != pd
I would like to close this bugzilla as a dup of bugzilla 132823 with the following summary statement: With current state of linux kernels (both RHEL and upstream), NFS failovers have been error-prone. An example is the "non-idempotent" issues associated with operations such as "rm" or "rename". This could start with (nfs) client sending in a file remove request ("rm"). The (nfs) server passes the call into filesystem and somehow gets stuck there for a while (say waiting for a directory lock). Timeout occurs in client side and re-transmit happens. Current linux NFSD is designed to handle this issue by a global "request reply cache" where each request is checked with entries in the cache. If a duplication is ever found, the re-transmitted request is subsequently dropped. However, in a failover scenario, if the back-up server gets the re-transmitted request, and if there is no equivialent entry in its own reply cache to protect this duplication, the client may end up getting various errors (ESTALE, EPERM, etc) return code, regardless the request has been carried out and succeeded in the original server. To ease these inherited NFS protocol issues, we've tentatively identified several work items that we plan to address in next few RHEL updates and/or releases via bugzilla 132823. Sorry that we have to flag the planned changes as private events to avoid false expectations, mostly because the work items are subject to the acceptance of upstrem linux kernel community and RHEL overall product directions. Before we complete the work via bz 132823, for NFS v2/V3, RHEL 4.4 has the following restrictions: B-1: Unless NFS client applications can tolerate ESTALE and/or EPERM errors, IO activities on the failover ip interface must be temporarily quiesced until active-active failover transition completes. This is to avoid non-idempotent NFS operation failure on the new server. (check out "Why NFS Sucks" by Olaf Kirch, placed as "kirch-reprint.pdf" in 2006 OLS proceeding). B-2: With various possible base kernel bugs outside RHCS' control, there are possibilities that local filesystem (such as ext3) umount could fail. To ensure data integrity, RHCS will abort the failover. Admin could either specify the self-fence (reboot taken-over server) option to force failover (via cluster.conf file) or re-mount the filesystem on the taken-over server as ro (read-only) to allow failover. Both options have the possibility of losing data. B-3: If nfs client invokes NLM locking call, the subject nfs servers (both take-over and take-over) will enter a global 90-second (tunable) locking grace period for every nfs service on the servers. B-4: If NFS-TCP is involved, failover should not be issued on the same pair of machines multiple times within 30-minute period; for example, failing over from node A to B, then immediately failing from B back to A would hang the connection. This is to avoid TCP TIME_WAIT issue. For NFS V4 failover, the issues were archived in: http://sourceforge.net/mailarchive/forum.php?thread_id=26040366&forum_id=4930 and http://sourceforge.net/mailarchive/forum.php?thread_id=26040369&forum_id=4930 The summary can be highlighted by NFS V4 developer J. Bruce Fields' reply as the following: <start quote> > I have been working on setting up NFSv4 in cluster scenario. But when > it comes to an Active-Active setup, it seems there are a few problems. Yes. Even active-passive failover is problematic right now, mainly because the method we use for storing reboot recovery information (analagous to the state recorded by statd) is still in flux. We're aware of these problems and working on solving them. There are some rough (possibly out of date notes) here: http://wiki.linux-nfs.org/index.php/Cluster_Coherent_NFS_design but those are more of interest to someone working on design of a solution than to someone trying to set up and use any of this (which we don't recommend doing yet). <end quote> Before the general NFS v4 dust settled for linux, we don't plan to address NFS V4 failover issues in bugzilla 132823 in order to keep the problem in a managable (and deliverable) state. *** This bug has been marked as a duplicate of 132823 ***