Bug 174638

Summary: GFS file system mounted with -o sync and exported over NFS gets data errors on failover
Product: [Retired] Red Hat Cluster Suite Reporter: Henry Harris <henry.harris>
Component: gfsAssignee: Robert Peterson <rpeterso>
Status: CLOSED DUPLICATE QA Contact: GFS Bugs <gfs-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: rkenna
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-02-14 17:20:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Henry Harris 2005-11-30 23:24:09 UTC
Description of problem: NFS traffic to a GFS file system mounted with the -o 
sync option gets data errors if the IP address for the mount fails over to 
another node in the cluster.  The problem does not occur if -o sync is not 
used.


Version-Release number of selected component (if applicable):


How reproducible:
Each time

Steps to Reproduce:
1. mount gfs file system with -o sync option
2. export over nfs and 
3. run io from nfs client mounted to an IP configured as a cluster service
4. fail node so that IP service fails over to another node
  
Actual results:
Data errors

Expected results:
No data errors


Additional info:
Using dt to genernate IO:

Created 2 nfs volumes one sync, one async. Started dt to both volumes and left
grid(node2) on 2 node cluster.  IO to sync volume was stopped.  The following
error message was contained in dt log.

dt (18333): 'write', errno = 5 - Input/output error
dt (18333): Relative block number where the error occcured is 1804668

dt (18333): Error number 1 occurred on Tue Nov 29 16:11:27 2005
dt (18333): 'fsync', errno = 5 - Input/output error
dt (18333): Error number 2 occurred on Tue Nov 29 16:11:27 2005

Total Statistics (18333):
     Output device/file name: /mnt/smk_01/dtfile-18333 (device type=regular)
     Type of I/O's performed: sequential (forward)
    Current Process Reported: 1/2
   Data pattern read/written: 0x39c39c39
     Total records processed: 1804668 @ 512 bytes/record (0.500 Kbytes)
     Total bytes transferred: 923990016 (902334.000 Kbytes, 881.186 Mbytes)
      Average transfer rates: 822553 bytes/sec, 803.274 Kbytes/sec
     Number I/O's per second: 1606.548
      Total passes completed: 0
       Total errors detected: 2/1
          Total elapsed time: 18m43.32s
           Total system time: 00m07.72s
             Total user time: 00m06.08s
               Starting time: Tue Nov 29 15:52:44 2005
                 Ending time: Tue Nov 29 16:11:27 2005

dt (18333): 'unlink', errno = 5 - Input/output error
dt (18333): Error number 3 occurred on Tue Nov 29 16:11:27 2005
dt (18334): 'write', errno = 5 - Input/output error
dt (18334): Relative block number where the error occcured is 1969501

dt (18334): Error number 1 occurred on Tue Nov 29 16:11:27 2005
dt (18334): 'fsync', errno = 5 - Input/output error
dt (18334): Error number 2 occurred on Tue Nov 29 16:11:27 2005

Total Statistics (18334):
     Output device/file name: /mnt/smk_01/dtfile-18334 (device type=regular)
     Type of I/O's performed: sequential (forward)
    Current Process Reported: 2/2
   Data pattern read/written: 0x00ff00ff
     Total records processed: 1969501 @ 512 bytes/record (0.500 Kbytes)
     Total bytes transferred: 1008384512 (984750.500 Kbytes, 961.670 Mbytes)
      Average transfer rates: 897203 bytes/sec, 876.175 Kbytes/sec
     Number I/O's per second: 1752.350
      Total passes completed: 0
       Total errors detected: 2/1
          Total elapsed time: 18m43.92s
           Total system time: 00m07.76s
             Total user time: 00m06.57s
               Starting time: Tue Nov 29 15:52:44 2005
                 Ending time: Tue Nov 29 16:11:27 2005

dt (18334): 'unlink', errno = 5 - Input/output error
dt (18334): Error number 3 occurred on Tue Nov 29 16:11:27 2005

Comment 1 Ben Marzinski 2005-12-13 17:42:58 UTC
There was an osync bug that kevin fixed earlier.  If you can't reproduce this
with the latest code, you might try pulling that fix out.  It's possible the
this is another side effect of that bug.

Comment 2 Robert Peterson 2005-12-21 19:19:08 UTC
I'd like to get some more information that will help me recreate this scenario
in our lab.  I was unable to recreate this problem using a three-node cluster,
i686 hardware and commands similar to this from my nfs client:

dt of=/mnt/bob/test.dt bs=512 passes=1 limit=1G log=/var/log/bob.dt.log

In every test I tried, my nfs server failed over to the next server successfully
and no errors were logged either by the dt tool nor the systems in the cluster.
 In each case, dt kept writing happily and my file size kept growing.  When my
primary nfs server came back, it took over nfs serving again, also without any
errors reported by dt.

By the way, my cluster is using the latest cluster suite rpms including 
GFS-kernel-2.6.9-47.1.i686.rpm

I would like the following information:

1. I'd like a list of all RPMs on the server nodes and client nodes (i.e. output
from rpm -qa).  Please put it in an attachment.  One copy is okay if the nodes
are all basically the same.  I'm especially interested in rpm -q GFS-kernel but
I'd like to see the rest too.

2. I'd like to get the exact dt command used to recreate the problem.

3. A brief description of the hardware involved with this problem.  (For
example, Pentium 4 CPU 2.40GHz, 512MB Ram, Brocade fencing, etc.)

4. A brief description of the environment in which the problem occurred.  For
example, is dt writing a new file or is it overwriting an existing file?  Is it
copying from the -o sync mount to the non-sync mount or copying from the
non-sync mount to the -o sync mount?  Are there multiple dt's running on
multiple files?  Is the nfs server (or client) under a heavy workload when the
failure occurs?  And so on.

5. Description of how the primary nfs server node was brought down to create the
failure (Did you use the /sbin/reboot command and with what parameters? Did you
pull the plug?  Did you get a kernel panic?  Did you tell your power switch to
cut the power?  Did you do a really long kernel-only op like insmod?)

6. Anything else you think might help me recreate the problem in our lab.

Thanks.

Bob Peterson


Comment 3 Robert Peterson 2006-02-14 17:20:28 UTC
I'm fairly certain this bug is the same as bz 178057.  The use of the -o sync
option probably just changes the timing slightly but does not eliminate the
problem.  The IO errors reported by the dt tool are probably the result of the
NFS3ERR_ACCES described in comment 20 of bz 178057.


*** This bug has been marked as a duplicate of 178057 ***