Bug 272301

Summary:

Bad things happen when you attempt multiple flocks from a single process

Product:

[Retired] Red Hat Cluster Suite

Reporter:

Abhijith Das <adas>

Component:

GFS-kernel

Assignee:

Abhijith Das <adas>

Status:

CLOSED WONTFIX

QA Contact:

Cluster QE <mspqa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

anandab, kanderso, rkenna, teigland

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

All

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-04-14 20:28:25 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

198302

Attachments:

Description	Flags
Program to create the problem.	none

Description Abhijith Das 2007-08-31 16:47:34 UTC

Description of problem:
The attached program flucker.c does two flock operations on the same file
through two file descriptors. You can change the lock modes in the test program
to do 4 combinations. The results with GFS1 are:

Lock EX followed by Lock EX:
GFS: fsid=MyClusterToo:gfs.0: warning: assertion "!error" failed
GFS: fsid=MyClusterToo:gfs.0:   function = do_flock
GFS: fsid=MyClusterToo:gfs.0:   file =
/home/devel/cluster/gfs-kernel/src/gfs/ops_file.c, line = 1678
GFS: fsid=MyClusterToo:gfs.0:   time = 1188578030
This assertion is from bug #198302

Lock EX followed by Lock SH:
GFS: fsid=MyClusterToo:gfs.0: warning: assertion "relaxed_state_ok(gl->gl_state,
gh->gh_state, gh->gh_flags)" failed
GFS: fsid=MyClusterToo:gfs.0:   function = add_to_queue
GFS: fsid=MyClusterToo:gfs.0:   file =
/home/devel/cluster/gfs-kernel/src/gfs/glock.c, line = 1413
GFS: fsid=MyClusterToo:gfs.0:   time = 1188578062

Lock SH followed by Lock EX:
GFS: fsid=MyClusterToo:gfs.0: warning: assertion "(tmp_gh->gh_flags &
GL_LOCAL_EXCL) || !(gh->gh_flags & GL_LOCAL_EXCL)" failed
GFS: fsid=MyClusterToo:gfs.0:   function = add_to_queue
GFS: fsid=MyClusterToo:gfs.0:   file =
/home/devel/cluster/gfs-kernel/src/gfs/glock.c, line = 1410
GFS: fsid=MyClusterToo:gfs.0:   time = 1188579430

Lock SH followed by Lock SH:
Works fine. Sometimes breaks into an oops as a result of previous flock operations.

Version-Release number of selected component (if applicable):


How reproducible:
Most of the time.

Steps to Reproduce:
1. Run the attached program ./flucker /mnt/gfs/foo
2. console might have these assertions/oopses
3. change the lock modes in flucker.c, compile and goto step 1

ext3 behaves correctly for the above cases:
EX on EX - EAGAIN
EX on SH - EAGAIN
SH on EX - EAGAIN
SH on SH - allowed

Comment 1 Abhijith Das 2007-08-31 16:47:34 UTC

Created attachment 183681 [details]
Program to create the problem.

Comment 3 Abhijith Das 2007-09-25 13:40:28 UTC

*** Bug 198302 has been marked as a duplicate of this bug. ***

Comment 4 Abhijith Das 2008-04-14 20:28:25 UTC

The real fix for this is quite invasive and might break the already fragile
flock code. There is an easy workaround to return -EAGAIN/-ENOSYS when a process
tries to flock the same file twice. But this workaround will mask the bug if it
ever appears in the field. If we find a real-world test-case that does
single-process-multiple-flocks, we can go after this one. Marking it WONTFIX.

Comment 5 Robert Clark 2008-04-29 16:29:45 UTC

We initially came across this bug trying to work out why the nodes in our live
cluster were occasionally rebooting. It turned out that one application had a
race condition when handling concurrent requests which would cause it to attempt
multiple locks on the same file. The result was kernel panics which were causing
the reboots:

Unable to handle kernel NULL pointer dereference at virtual address 0000000c 
 printing eip: 
82293ebf 
*pde = 00004001 
Oops: 0000 [#1] 
SMP  
Modules linked in: i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) ext3 jbd
dm_cmirror(U) dm_mirror dlm(U) cman(U) bonding(U) md5 ipv6 aoe(U) dm_mod button
battery ac uhci_hcd ehci_hcd tg3 sd_mod floppy ata_piix libata scsi_mod 
CPU:    0 
EIP:    0060:[<82293ebf>]    Not tainted VLI 
EFLAGS: 00010293   (2.6.9-67.0.7.ELhugemem)  
EIP is at add_to_queue+0x2c/0x27b [gfs] 
eax: 78a82030   ebx: 7767141c   ecx: 77671440   edx: 7fdfa524 
esi: 00000000   edi: 7fdfa4fc   ebp: 7fdfa4fc   esp: 70770eec 
ds: 007b   es: 007b   ss: 0068 
Process dod-upgrade-acc (pid: 10054, threadinfo=70770000 task=78a82030) 
Stack: 8222d000 7fdfa518 7767141c 8222d000 7fdfa4fc 822941d6 00000000 70904b88  
       00000000 00000480 7767141c 822a95a1 7767141c 00000001 70904b88 77671400  
       743107ec 704d5380 7fdfa4fc 80688500 70770f90 7836b8e0 021ad19a 70770f58  
Call Trace: 
 [<822941d6>] gfs_glock_nq+0xc8/0x116 [gfs] 
 [<822a95a1>] do_flock+0x111/0x182 [gfs] 
 [<021ad19a>] selinux_file_lock+0x7f/0x88 
 [<822a9673>] gfs_flock+0x0/0x76 [gfs] 
 [<0216e462>] sys_flock+0x96/0x120 
Code: 57 56 53 89 c3 51 8b 78 08 8b 87 9c 00 00 00 89 04 24 8b 43 0c 85 c0 0f 84
29 02 00 00 8b 77 28 8d 57 28 39 d6 0f 84 f6 00 00 00 <39> 46 0c 0f 85 e6 00 00
00 f6 43 14 08 75 2d f6 46 14 08 74 27  
 <0>Fatal exception: panic in 5 seconds 

Even without a full fix, it would be good to find a way to avoid this.