Description of problem: NFS traffic to a GFS file system mounted with the -o sync option gets data errors if the IP address for the mount fails over to another node in the cluster. The problem does not occur if -o sync is not used. Version-Release number of selected component (if applicable): How reproducible: Each time Steps to Reproduce: 1. mount gfs file system with -o sync option 2. export over nfs and 3. run io from nfs client mounted to an IP configured as a cluster service 4. fail node so that IP service fails over to another node Actual results: Data errors Expected results: No data errors Additional info: Using dt to genernate IO: Created 2 nfs volumes one sync, one async. Started dt to both volumes and left grid(node2) on 2 node cluster. IO to sync volume was stopped. The following error message was contained in dt log. dt (18333): 'write', errno = 5 - Input/output error dt (18333): Relative block number where the error occcured is 1804668 dt (18333): Error number 1 occurred on Tue Nov 29 16:11:27 2005 dt (18333): 'fsync', errno = 5 - Input/output error dt (18333): Error number 2 occurred on Tue Nov 29 16:11:27 2005 Total Statistics (18333): Output device/file name: /mnt/smk_01/dtfile-18333 (device type=regular) Type of I/O's performed: sequential (forward) Current Process Reported: 1/2 Data pattern read/written: 0x39c39c39 Total records processed: 1804668 @ 512 bytes/record (0.500 Kbytes) Total bytes transferred: 923990016 (902334.000 Kbytes, 881.186 Mbytes) Average transfer rates: 822553 bytes/sec, 803.274 Kbytes/sec Number I/O's per second: 1606.548 Total passes completed: 0 Total errors detected: 2/1 Total elapsed time: 18m43.32s Total system time: 00m07.72s Total user time: 00m06.08s Starting time: Tue Nov 29 15:52:44 2005 Ending time: Tue Nov 29 16:11:27 2005 dt (18333): 'unlink', errno = 5 - Input/output error dt (18333): Error number 3 occurred on Tue Nov 29 16:11:27 2005 dt (18334): 'write', errno = 5 - Input/output error dt (18334): Relative block number where the error occcured is 1969501 dt (18334): Error number 1 occurred on Tue Nov 29 16:11:27 2005 dt (18334): 'fsync', errno = 5 - Input/output error dt (18334): Error number 2 occurred on Tue Nov 29 16:11:27 2005 Total Statistics (18334): Output device/file name: /mnt/smk_01/dtfile-18334 (device type=regular) Type of I/O's performed: sequential (forward) Current Process Reported: 2/2 Data pattern read/written: 0x00ff00ff Total records processed: 1969501 @ 512 bytes/record (0.500 Kbytes) Total bytes transferred: 1008384512 (984750.500 Kbytes, 961.670 Mbytes) Average transfer rates: 897203 bytes/sec, 876.175 Kbytes/sec Number I/O's per second: 1752.350 Total passes completed: 0 Total errors detected: 2/1 Total elapsed time: 18m43.92s Total system time: 00m07.76s Total user time: 00m06.57s Starting time: Tue Nov 29 15:52:44 2005 Ending time: Tue Nov 29 16:11:27 2005 dt (18334): 'unlink', errno = 5 - Input/output error dt (18334): Error number 3 occurred on Tue Nov 29 16:11:27 2005
There was an osync bug that kevin fixed earlier. If you can't reproduce this with the latest code, you might try pulling that fix out. It's possible the this is another side effect of that bug.
I'd like to get some more information that will help me recreate this scenario in our lab. I was unable to recreate this problem using a three-node cluster, i686 hardware and commands similar to this from my nfs client: dt of=/mnt/bob/test.dt bs=512 passes=1 limit=1G log=/var/log/bob.dt.log In every test I tried, my nfs server failed over to the next server successfully and no errors were logged either by the dt tool nor the systems in the cluster. In each case, dt kept writing happily and my file size kept growing. When my primary nfs server came back, it took over nfs serving again, also without any errors reported by dt. By the way, my cluster is using the latest cluster suite rpms including GFS-kernel-2.6.9-47.1.i686.rpm I would like the following information: 1. I'd like a list of all RPMs on the server nodes and client nodes (i.e. output from rpm -qa). Please put it in an attachment. One copy is okay if the nodes are all basically the same. I'm especially interested in rpm -q GFS-kernel but I'd like to see the rest too. 2. I'd like to get the exact dt command used to recreate the problem. 3. A brief description of the hardware involved with this problem. (For example, Pentium 4 CPU 2.40GHz, 512MB Ram, Brocade fencing, etc.) 4. A brief description of the environment in which the problem occurred. For example, is dt writing a new file or is it overwriting an existing file? Is it copying from the -o sync mount to the non-sync mount or copying from the non-sync mount to the -o sync mount? Are there multiple dt's running on multiple files? Is the nfs server (or client) under a heavy workload when the failure occurs? And so on. 5. Description of how the primary nfs server node was brought down to create the failure (Did you use the /sbin/reboot command and with what parameters? Did you pull the plug? Did you get a kernel panic? Did you tell your power switch to cut the power? Did you do a really long kernel-only op like insmod?) 6. Anything else you think might help me recreate the problem in our lab. Thanks. Bob Peterson
I'm fairly certain this bug is the same as bz 178057. The use of the -o sync option probably just changes the timing slightly but does not eliminate the problem. The IO errors reported by the dt tool are probably the result of the NFS3ERR_ACCES described in comment 20 of bz 178057. *** This bug has been marked as a duplicate of 178057 ***