Bug 143823 - [PATCH] Stale POSIX flock
[PATCH] Stale POSIX flock
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
i686 Linux
medium Severity high
: ---
: ---
Assigned To: Peter Staubach
:
Depends On:
Blocks: 156320
  Show dependency treegraph
 
Reported: 2004-12-28 23:57 EST by Garik E
Modified: 2007-11-30 17:07 EST (History)
7 users (show)

See Also:
Fixed In Version: RHSA-2005-663
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-09-28 10:40:29 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Patch to remove a stale POSIX flock (1.06 KB, patch)
2004-12-29 03:37 EST, Garik E
no flags Details | Diff
Testcase to reproduce the problem (3.32 KB, text/plain)
2005-06-01 11:03 EDT, Peter Staubach
no flags Details
Patch to remove locks when the close/fcntl race is detected (1.27 KB, patch)
2005-06-01 11:11 EDT, Peter Staubach
no flags Details | Diff
Patch to remove locks when the close/fcntl race is detected (1.27 KB, patch)
2005-06-01 11:12 EDT, Peter Staubach
no flags Details | Diff

  None (edit)
Description Garik E 2004-12-28 23:57:44 EST
Description of problem:
Hello,

I've got a strange bug:
POSIX lock is left in system after process termination. 
The lock is also blocking all other processes that try to obtain POSIX
flock for the same file. Removing the file does not release the lock. 
Attempt to print /proc/locks crashes system.

After some debugging I found that filp->f_count entry at the end of
sys_fctl64 function is 1(???) and subsequent call to fput releases
file sturcture, but leaves POSIX lock.
I've added check for FL_POSIX flag to locks_remove_flock and the
problem stoped.
This bug happens very often on my environment, but user mode setup is
a multi-process/multi-threaded application, so I have no clue how to
reduce it to a simple testing program.


I work with enterprise linux 3, IBM xSeries345, Pentium4 3000 x2,  2GB RAM
Comment 1 Garik E 2004-12-29 03:37:20 EST
Created attachment 109163 [details]
Patch to remove a stale POSIX flock 

here is a patch I've used
Comment 3 Garik E 2005-01-04 00:08:49 EST
Hello,

Here is another issue for that bug:
If a process that is blocked by the stale POSIX lock gets SIGINT (Ctrl-C from
terminal) it may crash. Here is a OOPS message:
CPU:    0                                                                       
EIP:    0060:[<8016582f>]    Tainted: PF                                        
EFLAGS: 00010282                                                                
                                                                                
EIP is at __fput [kernel] 0xf (2.4.21-20cpsmp/i686)                             
eax: f72ffc00   ebx: f72ffc00   ecx: 8667fce4   edx: f72ffc00                   
esi: 00000000   edi: f6337c80   ebp: 00000000   esp: f1e83f10                   
ds: 0068   es: 0068   ss: 0068                                                  
Process awk (pid: 9572, stackpage=f1e83000)                                     
Stack: 00000296 8667fce4 836fe580 f72ffc00 f6337c80 835801bc 80165817 8017ce72  
       836fe580 00000000 00000000 836fe5cc f6fb2e80 00000003 f6337c80 00000000  
       80163a97 f6fb2e80 f6337c80 00000001 00000003 f6337c80 00000001 8012d71c  
Call Trace:                                                                     
[<80165817>] fput [kernel] 0x17 (0xf1e83f28)                                    
[<8017ce72>] locks_remove_posix [kernel] 0x132 (0xf1e83f2c)                     
[<80163a97>] filp_close [kernel] 0x87 (0xf1e83f50)                              
[<8012d71c>] put_files_struct [kernel] 0x6c (0xf1e83f6c)                        
[<8012dfea>] do_exit [kernel] 0x1ba (0xf1e83f88)                                
[<8012e35b>] do_group_exit [kernel] 0x8b (0xf1e83fa4)                           
[<8012e3a3>] sys_exit_group [kernel] 0x13 (0xf1e83fb8)                          
                                                                                
Code:  8b 7d 08 89 04 24 e8 76 76 01 00 8b 4b 60 85 c9 0f 85 cd 00
Comment 4 Garik E 2005-01-04 00:22:39 EST
This issue is already resolved in 2.6 - 1.63 revision of fs/locks.c:
http://linux.bkbits.net:8080/linux-2.6/diffs/fs/locks.c@1.63?nav=index.html|src/|src/fs|hist/fs/locks.c


Comment 5 D. Michaud 2005-02-16 10:24:30 EST
We're seeing this same kernel crash. We have a multithreaded
application that uses file locking over NFS. Within 24 hours the
RHE3-u3-4 kernels will panic with similar traces as above, always
referencing "locks_remove_posix".

Applying the patch in comment #1 prevents the kernel panic, although
it error message added by the patch shows up frequently.

Request that an official patch be rolled into U5.
Comment 6 Garik E 2005-02-17 05:12:06 EST
The patch does not solve the problem. 
It is a recovery operation after file counter becomes inconsistent.
 
Comment 10 Peter Staubach 2005-05-11 11:13:29 EDT
Is there anyway known way to reproduce this situation?  A reproducible
testcase sure would be a help.
Comment 14 Garik E 2005-05-16 00:19:15 EDT
Unfortunately I don’t have a simple testcase for this crash. You can ask Linux
if he has one: he has fixed this bug in 2.6
Comment 15 Garik E 2005-05-16 00:21:45 EDT
... well its Linus .
My misspell :)
Comment 16 Peter Staubach 2005-05-16 10:33:52 EDT
*** Bug 157846 has been marked as a duplicate of this bug. ***
Comment 17 Ernie Petrides 2005-05-24 18:46:22 EDT
Removing from U5 blocker list.
Comment 18 Garik E 2005-05-25 00:19:09 EDT
Well, it’s a shameful decision, but frankly, I haven’t been expecting else. 
So, does somebody at RH support have the balls to admit that this bug won’t be
solved in EL3 lifetime ?  
Comment 19 Ernie Petrides 2005-05-31 20:33:31 EDT
A fix for this problem has just been committed to the RHEL3 U6
patch pool this evening (in kernel version 2.4.21-32.6.EL).
Comment 20 Peter Staubach 2005-06-01 11:03:07 EDT
Created attachment 115031 [details]
Testcase to reproduce the problem

This is a testcase which shows the dangling lock problem by reproducing
the fcntl/close tace.
Comment 21 Peter Staubach 2005-06-01 11:11:39 EDT
Created attachment 115032 [details]
Patch to remove locks when the close/fcntl race is detected

These changes detect the fcntl/close race and correctly release the lock
which was just acquired.
Comment 22 Peter Staubach 2005-06-01 11:12:05 EDT
Created attachment 115033 [details]
Patch to remove locks when the close/fcntl race is detected

These changes detect the fcntl/close race and correctly release the lock
which was just acquired.
Comment 24 Peter Staubach 2005-07-21 15:58:06 EDT
*** Bug 157846 has been marked as a duplicate of this bug. ***
Comment 28 Red Hat Bugzilla 2005-09-28 10:40:29 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-663.html

Note You need to log in before you can comment on or make changes to this bug.