Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 515717

Summary: Flock on GFS fs file will error with "Resource tempory unavailable" for EWOULDBLOCK
Product: Red Hat Enterprise Linux 5 Reporter: Shane Bradley <sbradley>
Component: gfs-kmodAssignee: Abhijith Das <adas>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: high    
Version: 5.5CC: adas, bmarzins, edamato, jwest, rwheeler, swhiteho, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: gfs-kmod-0.1.34-11.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-03-30 08:56:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Reproducer script
none
attempt at a workaround patch none

Description Shane Bradley 2009-08-05 13:44:54 UTC
Description of problem:

Customer is doing some testing on flocks with GFS. When requesting a
"flock" on a file in GFS in rw or ro mode there will eventually be
errors of state "Resource temporary unavailable" which is error 11(for
EWOULDBLOCK). It appears that a flock cannot be obtained on the file.

The flock are opened with this line:
   " int ret = flock(g_filehandle, LOCK_SH | LOCK_NB);  "
If LOCK_NB is not used, then no errors are produced.

The script that orginal was submitted is attached to the bz. 
See additional notes on other details.

This does not occur on GFS2.  This was tested on latest release from
rhn.redhat.com and still fails with errors.

Version-Release number of selected component (if applicable):
kmod-gfs-0.1.23-5.el5-x86_64 ( tested on latest from rhn.redhat.com as
well and it fails)

How reproducible:
Everytime 

Steps to Reproduce:
1. setup GFS fs and mount to /<mountpoint>
2. then download the reproducer script to the mount point(gfs_trylock_test.cc)
3. cd /<mountpoint>
4. compile the script(see top of script for compile command) 
5. run ./gfs_trylock_test.cc
  
Actual results:

Errors occur trying to acquire a "flock" on a file. An error for
"Resource temporary unavailable" is returned which is error code
11(for EWOULDBLOCK)

Scripts custom error message:
"flock: tryrdlock failed: handle=3 \
             error=Resource temporarily unavailable errno=11 threadId=4112
  tryacquire_read failed. loopN=7 threadId=4112"

Expected results:
That the script should complete with no errors.

Additional info:

I have tested this script and made a couple modifications(which are
not included) and remove the file creation since the file creation
would open the file in "rw" mode thus exclusive access to all the
forked processes. I changed it so that it would open fd on a read only
file. The error still occurred in this mode as well.

Comment 1 Shane Bradley 2009-08-05 13:46:38 UTC
Created attachment 356316 [details]
Reproducer script

Comment 2 Robert Peterson 2009-08-05 14:18:50 UTC
I've been able to recreate this problem on GFS, and I've verified
that it does not recreate on GFS2.  My belief at this time is that
this was fixed in GFS2 by a patch that Abhi did that allowed glocks
for flocks to be shared.  Unfortunately, that GFS2 code has changed
a great deal since, so sorting it all out is a problem.  I don't
have a fix yet, but it should be fixable.

The problem also does not occur if mounted with -o localflocks but
that's not normally sane in a clustered environment.

Comment 4 Steve Whitehouse 2009-08-10 10:27:34 UTC
See also bz #421321

Comment 7 Steve Whitehouse 2009-08-25 09:56:09 UTC
Can you elaborate on the "massive changes" as the code looks pretty similar to me between gfs1 and gfs2. Am I looking at the wrong thing?

Comment 8 Abhijith Das 2010-01-11 16:40:38 UTC
Created attachment 383022 [details]
attempt at a workaround patch

I was able to reduce the parameters in the test script such that I could reproduce this bug with only a single process and two iterations of flock/unflock. It doesn't look like it is the same problem that was fixed in gfs2 where a process is queueing multiple flocks through multiple descriptors at the same time.

I've observed that this is a race between an unflock(LOCK_UN) and a subsequent flock. The unflock does a dq_uninit on the corresponding glock. When an flock request comes in before the glock can be unintialized, it fails with -EAGAIN (when the request is non-blocking). This gives the impression that some other process is holding the flock, whereas it's the previous unflock by the same process that's preventing the flock from succeeding.

As soon as unflock returns, the user should immediately be able to flock again.

When the flock is blocking, it correctly waits for the glock uninitialization from the previous unflock and goes on to process the flock.

For a non-blocking flock request, this patch checks for the condition where a previous unflock-related glock uninitialization may be pending and if so, disregards the TRY flag.

This patch seems to work correctly... the script completes without any errors and the QA locksmith test also succeeds.

Comment 23 errata-xmlrpc 2010-03-30 08:56:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0291.html