Bug 212203 - cannot verify write when using fcntl locks on NFS on GFS
cannot verify write when using fcntl locks on NFS on GFS
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: GFS-kernel (Show other bugs)
All Linux
medium Severity medium
: rc
: ---
Assigned To: Robert Peterson
Cluster QE
: Reopened
Depends On:
  Show dependency treegraph
Reported: 2006-10-25 13:05 EDT by Nate Straz
Modified: 2011-02-16 11:33 EST (History)
12 users (show)

See Also:
Fixed In Version: GFS-kernel-2.6.9-86.1.el4
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2011-02-16 11:33:43 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Reproducer (1.18 KB, text/x-csrc)
2007-06-28 06:47 EDT, Sachin Prabhu
no flags Details
Patch to fix the problem (3.39 KB, patch)
2010-07-21 16:42 EDT, Robert Peterson
no flags Details | Diff

  None (edit)
Description Nate Straz 2006-10-25 13:05:34 EDT
Description of problem:

When an fcntl lock it taken out on a file, a write, rewind, read causes a short

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Create a GFS file system
2. Export the file system via NFS
3. Mount the file system on an RHEL4 U4 client
4. change directory to the mount point
4. run `genesis -i 1 -n 1 -d 1 -L fcntl`

Actual results:

Test output:
creat_file(): Short read(3, buf, 1024) (nbytes is 0) on

strace output:
[pid  5300] mkdir("gendir_0", 0777)     = -1 EEXIST (File exists)
[pid  5300] open("gendir_0/rtaglameiunegnagjresqruf", O_RDWR|O_CREAT, 0666) = 3
[pid  5300] fcntl64(3, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0
[pid  5300] uname({sys="Linux", node="flea-04", ...}) = 0
[pid  5300] write(3, "C:5300:flea-04:genesis*C:5300:fl"..., 1024) = 1024
[pid  5300] lseek(3, 0, SEEK_SET)       = 0
[pid  5300] read(3, "", 1024)           = 0

Expected results:
The read() call should return 1024.

Additional info:

genesis is available from 
Comment 1 Nate Straz 2006-10-25 13:08:50 EDT
I probably should clone this since this affects multiple RHEL releases and file
systems.  I was able to verify the bug in the following client/server pairs:

RHEL4 U4 client talking to RHEL4 U4 server w/ GFS
RHEL4 U4 client talking to RHEL5 Beta2 server w/ ext3
RHEL4 U4 client talking to RHEL5 Beta2 server w/ GFS
Comment 4 Robert Peterson 2006-12-11 15:44:57 EST
I was able to recreate this problem on the trin-10/trin-11 cluster.
Comment 5 Dean Jansa 2006-12-13 12:26:38 EST
This looks related to fcntl() and NFS, if others disagree, we can open a new bug:

Running a RHEL5-Server-20061027.0 client, and a RHEL5-Server-20061207.4 server I
hit this on en exported ext3 fs:

Cmd run: 
 xiogen -S 8761 -i 30s -s read,write,readv,writev -m sequential -o -t 1b -T
1000b -F 1000b:lock1_small | xdoio -v -k -n 5

5 processes, each locking (fcntl() locks) each write request, rewind, verify.

Looks like they are seeing each others data (pid 13709 seems to have been the
last to get its data out).

[lock1_small] iogen starting up with the following:
[lock1_small] Iterations:      30s
[lock1_small] Seed:            8761
[lock1_small] Offset-mode:     sequential
[lock1_small] Overlap Flag:    on
[lock1_small] Mintrans:        512
[lock1_small] Maxtrans:        512000
[lock1_small] Requests:        read,write
[lock1_small] Syscalls:        read,readv,write,writev
[lock1_small] IO Type:         buffered
[lock1_small] Test Devices:
[lock1_small] Path                                                      Size
[lock1_small]                                                         (bytes)
[lock1_small] ---------------------------------------------------------------
[lock1_small] lock1_small                                                     
[lock1_small] *** xdoio(pid: 13713) DATA COMPARISON ERROR lock1_small ***
[lock1_small] Corrupt regions follow - unprintable chars are represented as '.'
[lock1_small] -----------------------------------------------------------------
[lock1_small] corrupt bytes starting at file offset 0
[lock1_small]     1st 32 expected bytes:  A:13713:flea-02:doio*A:13713:fle
[lock1_small]     1st 32 actual bytes:    E:13709:flea-02:doio*E:13709:fle
[lock1_small] *** xdoio(pid: 13711) DATA COMPARISON ERROR lock1_small ***
[lock1_small] Corrupt regions follow - unprintable chars are represented as '.'
[lock1_small] -----------------------------------------------------------------
[lock1_small] corrupt bytes starting at file offset 0
[lock1_small]     1st 32 expected bytes:  J:13711:flea-02:doio*J:13711:fle
[lock1_small]     1st 32 actual bytes:    E:13709:flea-02:doio*E:13709:fle
[lock1_small] *** xdoio(pid: 13712) DATA COMPARISON ERROR lock1_small ***
[lock1_small] Corrupt regions follow - unprintable chars are represented as '.'
[lock1_small] -----------------------------------------------------------------
[lock1_small] corrupt bytes starting at file offset 0
[lock1_small]     1st 32 expected bytes:  J:13712:flea-02:doio*J:13712:fle
[lock1_small]     1st 32 actual bytes:    E:13709:flea-02:doio*E:13709:fle
Comment 6 Robert Peterson 2006-12-19 14:02:08 EST
I was able to recreate this very easily using genesis with the
instructions in the problem description, using nfs and ext3.
I also verified that genesis ran fine on a local file system.
Since this is outside all the cluster code, I'm reassigning to Steve
Dickson and NFS.  By the way, I didn't get any errors when I ran the
xiogen command from comment #5.
Comment 7 Nate Straz 2007-01-29 11:30:49 EST
Is there a reason this was moved to 4.6?
Comment 8 Peter Staubach 2007-01-29 14:29:02 EST
Yes.  It will not be addressed in 4.5.
Comment 10 Dean Jansa 2007-03-06 17:03:44 EST
I managed to hit what looks to be another varient of this issue.  This time with
flock rather than fcntl locking.  Rhel5 client running against a RHEL4.5 server
exproting a gfs filesystem:

# On client
flea-02 $:  /usr/tests/sts/bin/genesis -S 4798 -i 30s -n 100 -d 10 -p 5  -L flock -k

*** DATA COMPARISON ERROR gendir_9/vwvaawfokmmlsjbhnxogdbqu ***
Corrupt regions follow - unprintable chars are represented as '.'
corrupt bytes starting at file offset 0
    1st 32 expected bytes:  A:2368:flea-02:genesis*A:2368:fl
    1st 32 actual bytes:    ................................

Looks like the re-read of the data is giving us back a buffer of zeros (hence
the '.' in the actual byte output).

Comment 11 Dean Jansa 2007-03-06 17:07:05 EST
Should note, the issue on comment 10 was seen on ia64 cluster.  The same case
passed on x86 cluster.

Comment 13 Nate Straz 2007-05-14 16:26:34 EDT
I seem to be hitting this between a RHEL5 Server GOLD client and a RHEL5 Server
errata (kernel-2.6.18-8.1.4.el5) server.
Comment 14 Sachin Prabhu 2007-06-28 06:44:58 EDT
A similar issue was seen on a RHEL 4 U4 NFS client. The problem can be seen with
the attached reproducer on the RHEL 4 U4 kernel. The reproducer needs one
argument which is the path to a file which does not already exist on a nfs mount.

We were not able to reproduce the issue with the U5 kernel. On closer
inspection, it appears that the problem was fixed by the patch in 2.6.9-42.8.EL
kernel and is in all probability fixed by the patch for bz 186142.
Comment 15 Sachin Prabhu 2007-06-28 06:47:13 EDT
Created attachment 158107 [details]
Comment 16 RHEL Product and Program Management 2007-09-07 15:39:17 EDT
This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.
Comment 18 Dean Jansa 2008-07-17 15:00:16 EDT
FWIW --  I hit this again, in the same way as comment #6, while running 4.7
server and RHEL5-Server-GOLD client.
Comment 22 Robert Peterson 2010-06-29 12:15:51 EDT
Not sure why this got bounced to Ric, but that's obviously
wrong.  Taking the bug for now.  It's almost certainly a
duplicate of 245024, but perhaps we can test that theory.
Comment 23 Robert Peterson 2010-06-29 12:22:23 EDT
Since this is RHEL4, I'll have to back-port the fix from 245024.
The question then becomes: Do we also need to port Jeff Layton's
co-requisite NFS patch as well?  Or is it already fixed in RHEL4?
Nate: I take it you want to see us push a RHEL4 patch for this,
once we verify it fixes the reported problem?
Comment 24 Jeff Layton 2010-06-29 12:44:46 EDT
A quick glance shows that this problem is likely present in RHEL4 too. That said, the patch in question is an NFS client-side patch and only matters for the NFS client in the test setup. It'll have no bearing on the server side (and thus shouldn't block the GFS fix).
Comment 25 Robert Peterson 2010-07-21 16:39:47 EDT
I back-ported the patch to RHEL4 and tested it on roth-07.
Setting some ack flags to get this into 4.9.  I'll post the
Comment 26 RHEL Product and Program Management 2010-07-21 16:40:18 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
Comment 27 Robert Peterson 2010-07-21 16:42:00 EDT
Created attachment 433497 [details]
Patch to fix the problem

This is the crosswrite patch I tested.
Comment 28 Robert Peterson 2010-07-21 16:53:18 EDT
I pushed the patch to the RHEL4 and RHEL49 branches of the cluster
git tree for inclusion into 4.9.  It was tested on system roth-07
and roth-01.  The patch was back-ported from RHEL5.6 so there are
no upstream requirements.  Changing status to POST until it gets
Comment 29 Robert Peterson 2010-08-31 12:21:45 EDT
Chris Feist does the RHEL4 builds; adding him to the cc list
to help ensure this gets built for 4.9.
Comment 30 Robert Peterson 2010-09-23 11:58:25 EDT
Changing component from kernel to GFS-kernel.
Comment 32 errata-xmlrpc 2011-02-16 11:33:43 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.