Description of problem: The attached program flucker.c does two flock operations on the same file through two file descriptors. You can change the lock modes in the test program to do 4 combinations. The results with GFS1 are: Lock EX followed by Lock EX: GFS: fsid=MyClusterToo:gfs.0: warning: assertion "!error" failed GFS: fsid=MyClusterToo:gfs.0: function = do_flock GFS: fsid=MyClusterToo:gfs.0: file = /home/devel/cluster/gfs-kernel/src/gfs/ops_file.c, line = 1678 GFS: fsid=MyClusterToo:gfs.0: time = 1188578030 This assertion is from bug #198302 Lock EX followed by Lock SH: GFS: fsid=MyClusterToo:gfs.0: warning: assertion "relaxed_state_ok(gl->gl_state, gh->gh_state, gh->gh_flags)" failed GFS: fsid=MyClusterToo:gfs.0: function = add_to_queue GFS: fsid=MyClusterToo:gfs.0: file = /home/devel/cluster/gfs-kernel/src/gfs/glock.c, line = 1413 GFS: fsid=MyClusterToo:gfs.0: time = 1188578062 Lock SH followed by Lock EX: GFS: fsid=MyClusterToo:gfs.0: warning: assertion "(tmp_gh->gh_flags & GL_LOCAL_EXCL) || !(gh->gh_flags & GL_LOCAL_EXCL)" failed GFS: fsid=MyClusterToo:gfs.0: function = add_to_queue GFS: fsid=MyClusterToo:gfs.0: file = /home/devel/cluster/gfs-kernel/src/gfs/glock.c, line = 1410 GFS: fsid=MyClusterToo:gfs.0: time = 1188579430 Lock SH followed by Lock SH: Works fine. Sometimes breaks into an oops as a result of previous flock operations. Version-Release number of selected component (if applicable): How reproducible: Most of the time. Steps to Reproduce: 1. Run the attached program ./flucker /mnt/gfs/foo 2. console might have these assertions/oopses 3. change the lock modes in flucker.c, compile and goto step 1 ext3 behaves correctly for the above cases: EX on EX - EAGAIN EX on SH - EAGAIN SH on EX - EAGAIN SH on SH - allowed
Created attachment 183681 [details] Program to create the problem.
*** Bug 198302 has been marked as a duplicate of this bug. ***
The real fix for this is quite invasive and might break the already fragile flock code. There is an easy workaround to return -EAGAIN/-ENOSYS when a process tries to flock the same file twice. But this workaround will mask the bug if it ever appears in the field. If we find a real-world test-case that does single-process-multiple-flocks, we can go after this one. Marking it WONTFIX.
We initially came across this bug trying to work out why the nodes in our live cluster were occasionally rebooting. It turned out that one application had a race condition when handling concurrent requests which would cause it to attempt multiple locks on the same file. The result was kernel panics which were causing the reboots: Unable to handle kernel NULL pointer dereference at virtual address 0000000c printing eip: 82293ebf *pde = 00004001 Oops: 0000 [#1] SMP Modules linked in: i2c_dev i2c_core lock_dlm(U) gfs(U) lock_harness(U) ext3 jbd dm_cmirror(U) dm_mirror dlm(U) cman(U) bonding(U) md5 ipv6 aoe(U) dm_mod button battery ac uhci_hcd ehci_hcd tg3 sd_mod floppy ata_piix libata scsi_mod CPU: 0 EIP: 0060:[<82293ebf>] Not tainted VLI EFLAGS: 00010293 (2.6.9-67.0.7.ELhugemem) EIP is at add_to_queue+0x2c/0x27b [gfs] eax: 78a82030 ebx: 7767141c ecx: 77671440 edx: 7fdfa524 esi: 00000000 edi: 7fdfa4fc ebp: 7fdfa4fc esp: 70770eec ds: 007b es: 007b ss: 0068 Process dod-upgrade-acc (pid: 10054, threadinfo=70770000 task=78a82030) Stack: 8222d000 7fdfa518 7767141c 8222d000 7fdfa4fc 822941d6 00000000 70904b88 00000000 00000480 7767141c 822a95a1 7767141c 00000001 70904b88 77671400 743107ec 704d5380 7fdfa4fc 80688500 70770f90 7836b8e0 021ad19a 70770f58 Call Trace: [<822941d6>] gfs_glock_nq+0xc8/0x116 [gfs] [<822a95a1>] do_flock+0x111/0x182 [gfs] [<021ad19a>] selinux_file_lock+0x7f/0x88 [<822a9673>] gfs_flock+0x0/0x76 [gfs] [<0216e462>] sys_flock+0x96/0x120 Code: 57 56 53 89 c3 51 8b 78 08 8b 87 9c 00 00 00 89 04 24 8b 43 0c 85 c0 0f 84 29 02 00 00 8b 77 28 8d 57 28 39 d6 0f 84 f6 00 00 00 <39> 46 0c 0f 85 e6 00 00 00 f6 43 14 08 75 2d f6 46 14 08 74 27 <0>Fatal exception: panic in 5 seconds Even without a full fix, it would be good to find a way to avoid this.