212203 – cannot verify write when using fcntl locks on NFS on GFS

Bug 212203 - cannot verify write when using fcntl locks on NFS on GFS

Summary: cannot verify write when using fcntl locks on NFS on GFS

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	GFS-kernel
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Assignee:	Robert Peterson
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-10-25 17:05 UTC by Nate Straz
Modified:	2018-10-19 22:38 UTC (History)
CC List:	12 users (show)
Fixed In Version:	GFS-kernel-2.6.9-86.1.el4
Clone Of:
Environment:
Last Closed:	2011-02-16 16:33:43 UTC
Embargoed:

Attachments	(Terms of Use)
Reproducer (1.18 KB, text/x-csrc) 2007-06-28 10:47 UTC, Sachin Prabhu	no flags	Details
Patch to fix the problem (3.39 KB, patch) 2010-07-21 20:42 UTC, Robert Peterson	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0276	0	normal	SHIPPED_LIVE	GFS-kernel bug-fix update	2011-02-16 16:33:34 UTC

Description Nate Straz 2006-10-25 17:05:34 UTC

Description of problem:

When an fcntl lock it taken out on a file, a write, rewind, read causes a short
read.


Version-Release number of selected component (if applicable):
kernel-2.6.9-42.0.3.EL

How reproducible:
100%

Steps to Reproduce:
1. Create a GFS file system
2. Export the file system via NFS
3. Mount the file system on an RHEL4 U4 client
4. change directory to the mount point
4. run `genesis -i 1 -n 1 -d 1 -L fcntl`

Actual results:

Test output:
creat_file(): Short read(3, buf, 1024) (nbytes is 0) on
gendir_0/rtaglameiunegnagjresqruf

strace output:
[pid  5300] mkdir("gendir_0", 0777)     = -1 EEXIST (File exists)
[pid  5300] open("gendir_0/rtaglameiunegnagjresqruf", O_RDWR|O_CREAT, 0666) = 3
[pid  5300] fcntl64(3, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}) = 0
[pid  5300] uname({sys="Linux", node="flea-04", ...}) = 0
[pid  5300] write(3, "C:5300:flea-04:genesis*C:5300:fl"..., 1024) = 1024
[pid  5300] lseek(3, 0, SEEK_SET)       = 0
[pid  5300] read(3, "", 1024)           = 0

Expected results:
The read() call should return 1024.

Additional info:

genesis is available from 
http://sts.lab.msp.redhat.com/svn/viewvc.py/sistina-test/trunk/src/genesis/genesis.c

Comment 1 Nate Straz 2006-10-25 17:08:50 UTC

I probably should clone this since this affects multiple RHEL releases and file
systems.  I was able to verify the bug in the following client/server pairs:

RHEL4 U4 client talking to RHEL4 U4 server w/ GFS
RHEL4 U4 client talking to RHEL5 Beta2 server w/ ext3
RHEL4 U4 client talking to RHEL5 Beta2 server w/ GFS

Comment 4 Robert Peterson 2006-12-11 20:44:57 UTC

I was able to recreate this problem on the trin-10/trin-11 cluster.

Comment 5 Dean Jansa 2006-12-13 17:26:38 UTC

This looks related to fcntl() and NFS, if others disagree, we can open a new bug:

Running a RHEL5-Server-20061027.0 client, and a RHEL5-Server-20061207.4 server I
hit this on en exported ext3 fs:

Cmd run: 
 xiogen -S 8761 -i 30s -s read,write,readv,writev -m sequential -o -t 1b -T
1000b -F 1000b:lock1_small | xdoio -v -k -n 5

5 processes, each locking (fcntl() locks) each write request, rewind, verify.

Looks like they are seeing each others data (pid 13709 seems to have been the
last to get its data out).



[lock1_small] iogen starting up with the following:
[lock1_small] 
[lock1_small] Iterations:      30s
[lock1_small] Seed:            8761
[lock1_small] Offset-mode:     sequential
[lock1_small] Overlap Flag:    on
[lock1_small] Mintrans:        512
[lock1_small] Maxtrans:        512000
[lock1_small] Requests:        read,write
[lock1_small] Syscalls:        read,readv,write,writev
[lock1_small] IO Type:         buffered
[lock1_small] 
[lock1_small] Test Devices:
[lock1_small] 
[lock1_small] Path                                                      Size
[lock1_small]                                                         (bytes)
[lock1_small] ---------------------------------------------------------------
[lock1_small] lock1_small                                                     
512000
[lock1_small] *** xdoio(pid: 13713) DATA COMPARISON ERROR lock1_small ***
[lock1_small] Corrupt regions follow - unprintable chars are represented as '.'
[lock1_small] -----------------------------------------------------------------
[lock1_small] corrupt bytes starting at file offset 0
[lock1_small]     1st 32 expected bytes:  A:13713:flea-02:doio*A:13713:fle
[lock1_small]     1st 32 actual bytes:    E:13709:flea-02:doio*E:13709:fle
[lock1_small] 
[lock1_small] *** xdoio(pid: 13711) DATA COMPARISON ERROR lock1_small ***
[lock1_small] Corrupt regions follow - unprintable chars are represented as '.'
[lock1_small] -----------------------------------------------------------------
[lock1_small] corrupt bytes starting at file offset 0
[lock1_small]     1st 32 expected bytes:  J:13711:flea-02:doio*J:13711:fle
[lock1_small]     1st 32 actual bytes:    E:13709:flea-02:doio*E:13709:fle
[lock1_small] 
[lock1_small] *** xdoio(pid: 13712) DATA COMPARISON ERROR lock1_small ***
[lock1_small] Corrupt regions follow - unprintable chars are represented as '.'
[lock1_small] -----------------------------------------------------------------
[lock1_small] corrupt bytes starting at file offset 0
[lock1_small]     1st 32 expected bytes:  J:13712:flea-02:doio*J:13712:fle
[lock1_small]     1st 32 actual bytes:    E:13709:flea-02:doio*E:13709:fle

Comment 6 Robert Peterson 2006-12-19 19:02:08 UTC

I was able to recreate this very easily using genesis with the
instructions in the problem description, using nfs and ext3.
I also verified that genesis ran fine on a local file system.
Since this is outside all the cluster code, I'm reassigning to Steve
Dickson and NFS.  By the way, I didn't get any errors when I ran the
xiogen command from comment #5.

Comment 7 Nate Straz 2007-01-29 16:30:49 UTC

Is there a reason this was moved to 4.6?

Comment 8 Peter Staubach 2007-01-29 19:29:02 UTC

Yes.  It will not be addressed in 4.5.

Comment 10 Dean Jansa 2007-03-06 22:03:44 UTC

I managed to hit what looks to be another varient of this issue.  This time with
flock rather than fcntl locking.  Rhel5 client running against a RHEL4.5 server
exproting a gfs filesystem:

# On client
flea-02 $:  /usr/tests/sts/bin/genesis -S 4798 -i 30s -n 100 -d 10 -p 5  -L flock -k

Results:
*** DATA COMPARISON ERROR gendir_9/vwvaawfokmmlsjbhnxogdbqu ***
Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 0
    1st 32 expected bytes:  A:2368:flea-02:genesis*A:2368:fl
    1st 32 actual bytes:    ................................


Looks like the re-read of the data is giving us back a buffer of zeros (hence
the '.' in the actual byte output).

Comment 11 Dean Jansa 2007-03-06 22:07:05 UTC

Should note, the issue on comment 10 was seen on ia64 cluster.  The same case
passed on x86 cluster.

Comment 13 Nate Straz 2007-05-14 20:26:34 UTC

I seem to be hitting this between a RHEL5 Server GOLD client and a RHEL5 Server
errata (kernel-2.6.18-8.1.4.el5) server.

Comment 14 Sachin Prabhu 2007-06-28 10:44:58 UTC

A similar issue was seen on a RHEL 4 U4 NFS client. The problem can be seen with
the attached reproducer on the RHEL 4 U4 kernel. The reproducer needs one
argument which is the path to a file which does not already exist on a nfs mount.

We were not able to reproduce the issue with the U5 kernel. On closer
inspection, it appears that the problem was fixed by the patch in 2.6.9-42.8.EL
kernel and is in all probability fixed by the patch for bz 186142.

Comment 15 Sachin Prabhu 2007-06-28 10:47:13 UTC

Created attachment 158107 [details]
Reproducer

Comment 16 RHEL Program Management 2007-09-07 19:39:17 UTC

This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.

Comment 18 Dean Jansa 2008-07-17 19:00:16 UTC

FWIW --  I hit this again, in the same way as comment #6, while running 4.7
server and RHEL5-Server-GOLD client.

Comment 22 Robert Peterson 2010-06-29 16:15:51 UTC

Not sure why this got bounced to Ric, but that's obviously
wrong.  Taking the bug for now.  It's almost certainly a
duplicate of 245024, but perhaps we can test that theory.

Comment 23 Robert Peterson 2010-06-29 16:22:23 UTC

Since this is RHEL4, I'll have to back-port the fix from 245024.
The question then becomes: Do we also need to port Jeff Layton's
co-requisite NFS patch as well?  Or is it already fixed in RHEL4?
Nate: I take it you want to see us push a RHEL4 patch for this,
once we verify it fixes the reported problem?

Comment 24 Jeff Layton 2010-06-29 16:44:46 UTC

A quick glance shows that this problem is likely present in RHEL4 too. That said, the patch in question is an NFS client-side patch and only matters for the NFS client in the test setup. It'll have no bearing on the server side (and thus shouldn't block the GFS fix).

Comment 25 Robert Peterson 2010-07-21 20:39:47 UTC

I back-ported the patch to RHEL4 and tested it on roth-07.
Setting some ack flags to get this into 4.9.  I'll post the
patch.

Comment 26 RHEL Program Management 2010-07-21 20:40:18 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 27 Robert Peterson 2010-07-21 20:42:00 UTC

Created attachment 433497 [details]
Patch to fix the problem

This is the crosswrite patch I tested.

Comment 28 Robert Peterson 2010-07-21 20:53:18 UTC

I pushed the patch to the RHEL4 and RHEL49 branches of the cluster
git tree for inclusion into 4.9.  It was tested on system roth-07
and roth-01.  The patch was back-ported from RHEL5.6 so there are
no upstream requirements.  Changing status to POST until it gets
built.

Comment 29 Robert Peterson 2010-08-31 16:21:45 UTC

Chris Feist does the RHEL4 builds; adding him to the cc list
to help ensure this gets built for 4.9.

Comment 30 Robert Peterson 2010-09-23 15:58:25 UTC

Changing component from kernel to GFS-kernel.

Comment 32 errata-xmlrpc 2011-02-16 16:33:43 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0276.html

Note You need to log in before you can comment on or make changes to this bug.