Bug 515717 - Flock on GFS fs file will error with "Resource tempory unavailable" for EWOULDBLOCK
Summary: Flock on GFS fs file will error with "Resource tempory unavailable" for EWOUL...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: gfs-kmod
Version: 5.5
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Abhijith Das
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-08-05 13:44 UTC by Shane Bradley
Modified: 2018-10-27 14:54 UTC (History)
7 users (show)

Fixed In Version: gfs-kmod-0.1.34-11.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-30 08:56:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Reproducer script (2.81 KB, application/octet-stream)
2009-08-05 13:46 UTC, Shane Bradley
no flags Details
attempt at a workaround patch (2.20 KB, patch)
2010-01-11 16:40 UTC, Abhijith Das
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0291 0 normal SHIPPED_LIVE Moderate: gfs-kmod security, bug fix and enhancement update 2010-03-29 14:12:22 UTC

Description Shane Bradley 2009-08-05 13:44:54 UTC
Description of problem:

Customer is doing some testing on flocks with GFS. When requesting a
"flock" on a file in GFS in rw or ro mode there will eventually be
errors of state "Resource temporary unavailable" which is error 11(for
EWOULDBLOCK). It appears that a flock cannot be obtained on the file.

The flock are opened with this line:
   " int ret = flock(g_filehandle, LOCK_SH | LOCK_NB);  "
If LOCK_NB is not used, then no errors are produced.

The script that orginal was submitted is attached to the bz. 
See additional notes on other details.

This does not occur on GFS2.  This was tested on latest release from
rhn.redhat.com and still fails with errors.

Version-Release number of selected component (if applicable):
kmod-gfs-0.1.23-5.el5-x86_64 ( tested on latest from rhn.redhat.com as
well and it fails)

How reproducible:
Everytime 

Steps to Reproduce:
1. setup GFS fs and mount to /<mountpoint>
2. then download the reproducer script to the mount point(gfs_trylock_test.cc)
3. cd /<mountpoint>
4. compile the script(see top of script for compile command) 
5. run ./gfs_trylock_test.cc
  
Actual results:

Errors occur trying to acquire a "flock" on a file. An error for
"Resource temporary unavailable" is returned which is error code
11(for EWOULDBLOCK)

Scripts custom error message:
"flock: tryrdlock failed: handle=3 \
             error=Resource temporarily unavailable errno=11 threadId=4112
  tryacquire_read failed. loopN=7 threadId=4112"

Expected results:
That the script should complete with no errors.

Additional info:

I have tested this script and made a couple modifications(which are
not included) and remove the file creation since the file creation
would open the file in "rw" mode thus exclusive access to all the
forked processes. I changed it so that it would open fd on a read only
file. The error still occurred in this mode as well.

Comment 1 Shane Bradley 2009-08-05 13:46:38 UTC
Created attachment 356316 [details]
Reproducer script

Comment 2 Robert Peterson 2009-08-05 14:18:50 UTC
I've been able to recreate this problem on GFS, and I've verified
that it does not recreate on GFS2.  My belief at this time is that
this was fixed in GFS2 by a patch that Abhi did that allowed glocks
for flocks to be shared.  Unfortunately, that GFS2 code has changed
a great deal since, so sorting it all out is a problem.  I don't
have a fix yet, but it should be fixable.

The problem also does not occur if mounted with -o localflocks but
that's not normally sane in a clustered environment.

Comment 4 Steve Whitehouse 2009-08-10 10:27:34 UTC
See also bz #421321

Comment 7 Steve Whitehouse 2009-08-25 09:56:09 UTC
Can you elaborate on the "massive changes" as the code looks pretty similar to me between gfs1 and gfs2. Am I looking at the wrong thing?

Comment 8 Abhijith Das 2010-01-11 16:40:38 UTC
Created attachment 383022 [details]
attempt at a workaround patch

I was able to reduce the parameters in the test script such that I could reproduce this bug with only a single process and two iterations of flock/unflock. It doesn't look like it is the same problem that was fixed in gfs2 where a process is queueing multiple flocks through multiple descriptors at the same time.

I've observed that this is a race between an unflock(LOCK_UN) and a subsequent flock. The unflock does a dq_uninit on the corresponding glock. When an flock request comes in before the glock can be unintialized, it fails with -EAGAIN (when the request is non-blocking). This gives the impression that some other process is holding the flock, whereas it's the previous unflock by the same process that's preventing the flock from succeeding.

As soon as unflock returns, the user should immediately be able to flock again.

When the flock is blocking, it correctly waits for the glock uninitialization from the previous unflock and goes on to process the flock.

For a non-blocking flock request, this patch checks for the condition where a previous unflock-related glock uninitialization may be pending and if so, disregards the TRY flag.

This patch seems to work correctly... the script completes without any errors and the QA locksmith test also succeeds.

Comment 23 errata-xmlrpc 2010-03-30 08:56:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0291.html


Note You need to log in before you can comment on or make changes to this bug.