Bug 456282 - (CVE-2008-4307) CVE-2008-4307 Kernel BUG() in locks_remove_flock
CVE-2008-4307 Kernel BUG() in locks_remove_flock
Status: CLOSED ERRATA
Product: Security Response
Classification: Other
Component: vulnerability (Show other bugs)
unspecified
All Linux
high Severity high
: ---
: ---
Assigned To: Red Hat Product Security
source=redhat,reported=20080723,impac...
: Security
: 464763 491695 499089 (view as bug list)
Depends On: 456284 456285 456287 456288 456384 463870
Blocks:
  Show dependency treegraph
 
Reported: 2008-07-22 12:32 EDT by Eugene Teo (Security Response)
Modified: 2012-02-13 07:10 EST (History)
41 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-12-23 19:08:10 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Lock a file with fcntl/POSIX locking. (750 bytes, text/plain)
2008-07-22 16:27 EDT, Bryn M. Reeves
no flags Details
reproducer - lock file in one thread and close in another (1.25 KB, text/plain)
2008-07-22 18:36 EDT, Bryn M. Reeves
no flags Details
Patch to check for fcntl/close race and clean up POSIX locks (1.22 KB, patch)
2008-08-20 12:27 EDT, Bryn M. Reeves
no flags Details | Diff
Backport c4d7c402b788b73dc24f1e54a57f89d3dc5eb7bc to RHEL4 (1.18 KB, patch)
2008-12-04 14:21 EST, Bryn M. Reeves
no flags Details | Diff
revised patch (1.19 KB, patch)
2008-12-11 08:05 EST, Bryn M. Reeves
no flags Details | Diff

  None (edit)
Comment 2 Bryn M. Reeves 2008-07-22 15:55:54 EDT
This is the BUG on RHEL4 somewhat mangled by netconsole - will try to get a
cleaner version:

------------[ cut here ]------------
kernel BUG at fs/locks.c:1798!
invalid operand: 0000 [#1] SMP 
CPU:    0
Modules linked in: nfs lockd nfs_acl md5 ipv6 parport_pc lp parport netconsole
netdump autofs4 sunrpc dm_multipath usb_storage button battery ac uhci_hcd
ehci_hcd hw_random snd_azx snd_hda_codec sn
d_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc tg3
qla2xxx scsi_transport_fc dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ata_piix
libata sd_mod scsi_mod
Process locker (pid: 4908, threadinfo=f6256000 task=f686a8b0)
EIP:    0060:[<c016efc4>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-55.ELsmp) 
EIP is at locks_remove_flock+0xa1/0xe1
eax: f6b3612c   ebx: f6e414b4   ecx: 00000000   edx: 00000081
esi: 00000000   edi: f6e4140c   ebp: f67a7380   esp: f6256f28
ds: 007b   es: 007b   ss: 0068

Stack: 00000007 f67a7380 00000003 00000003 f667f00c 00000286 00000046 f667f00c 
       c01fd679 00000000 00000286 0000001e c01fe418 0000001e 0000001e 00000000 
       b7f57d42 c0202c45 f667f00c 00000001 00000000 00000246 00000000 f67a7380 
Pid: 4908, comm:               locker
EIP: 0060:[<c016efc4>] CPU: 0
EIP is at locks_remove_flock+0xa1/0xe1
 EFLAGS: 00010246    Not tainted  (2.6.9-55.ELsmp)
EAX: f6b3612c EBX: f6e414b4 ECX: 00000000 EDX: 00000081
ESI: 00000000 EDI: f6e4140c EBP: f67a7380 DS: 007b ES: 007b
CR0: 8005003b CR2: b7fa8000 CR3: 36fb6fa0 CR4: 000006f0
Code: 38 39 68 2c 75 2d 0f b6 50 30 f6 c2 02 74 09 89 d8 e8 9f df ff ff eb 1d f6
c2 20 74 0e ba 02 00 00 00 89 d8 e8 ba ec ff ff eb 0a <0f> 0b 06 07 33 ad 2e c0
89 c3 8b 03 eb c4 b8 00 f0 ff ff 21 
e0 
Call Trace:
 [<c015bd82>] __fput+0x41/0x100
 [<c016ab21>] sys_fcntl64+0x79/0x80
 [<c02d5ee3>] syscall_call+0x7/0xb
Comment 3 Bryn M. Reeves 2008-07-22 16:17:20 EDT
This is a problem with how the RHEL4/5 kernels handle POSIX locks. It is
possible for an fcntl call to race with a close on the same file descriptor. A
similar problem was fixed recently but this appears to be a separate issue
(reproduced on 2.6.18-92.1.6.el5).

Acquiring a POSIX lock via fcntl goes like this:

sys_fcntl()
    fget()
    do_fcntl()
        fcntl_setlk()
    fput()
        if(!count) __fput()
            locks_remove_flock()

The fcntl_setlk() call can potentially block for long durations (SETLKW waiting,
slow/distant lockd) allowing another thread in the same process to e.g. close
the file descriptor.

And a close looks like this:

sys_close()
    filp_close()
        locks_remove_posix()
        fput()
            if(!count) __fput()
                locks_remove_flock()

If we close the file descriptor from one thread while another is still blocked
in fcntl_setlk but before the lock has been granted (and before the struct
file_lock is placed on the inode's i_lock list) the call to locks_remove_posix
in the close path will "miss" the POSIX lock. We also fail to call
locks_remove_flock at this point since the thread in fcntl_setlk is still
holding a reference on the file.

This causes the final fput from the sys_fcntl return path to trigger a BUG when
__fput calls locks_remove_flock:

/*
 * This function is called on the last close of an open file.
 */
void locks_remove_flock(struct file *filp)
{
 [...]
        while ((fl = *before) != NULL) {
                if (fl->fl_file == filp) {
                        if (IS_FLOCK(fl)) {
                                locks_delete_lock(before);
                                continue;
                        }
                        if (IS_LEASE(fl)) {
                                lease_modify(before, F_UNLCK);
                                continue;
                        }
                        if (IS_POSIX(fl))
                                continue;
                        /* What? */
                        BUG(); <----
                }
                before = &fl->fl_next;
        }
        unlock_kernel();
}
Comment 4 Bryn M. Reeves 2008-07-22 16:18:48 EDT
Discussing this with Al Viro this appears similar to a problem fixed in our
trees back in May. 2.6.18-92.1.6.el5 should contain that fix however so this
appears to be something different.
Comment 5 Bryn M. Reeves 2008-07-22 16:27:31 EDT
Created attachment 312392 [details]
Lock a file with fcntl/POSIX locking.

Lock a file with fcntl/POSIX locking.
Comment 7 Bryn M. Reeves 2008-07-22 16:35:36 EDT
I can reliably reproduce this with the code in comment #5 and comment #6.

Steps to reproduce:

1. Mount a writable NFS volume from a remote host with nfslock running.
2. Compile the two test programs
gcc -pthread -Wall -o locker locker.c
gcc -Wall -o lockit lockit.c
3. In one terminal start lockit running on a file on the NFS volume:

# lockit /mnt/test

4. In a 2nd terminal start locker running on the same file:

# locker /mnt/test

5. Terminate lockit (CTRL-C)

Actual results:
# ./locker /mnt/locktest
pid 4907 main launching thread
pid 4907 thread 4908 in do_lock
pid 4907 thread 4908  locking
Read from remote host p380-1.gsslab: Connection reset by peer
Connection to p380-1.gsslab closed.

Host died with BUG() in comment #2

Expected results:
No bug.

Additional information:
Should be possible to trigger this with just one program but I found having
something already locking the file and then using SETLKW seemed to make it
easier to trigger.
Comment 8 Bryn M. Reeves 2008-07-22 17:47:01 EDT
Reproducable locally:

Same steps as in comment #7 but the nfs volume is exported locally from the test
machine:

kernel BUG at fs/locks.c:1798!
invalid operand: 0000 [#1]
SMP 
Modules linked in: netconsole netdump loop nfs nfsd exportfs lockd nfs_acl md5
ipv6 parport_pc lp parport autofs4 sunrpc dm_multipath usb_storage button
battery ac uhci_hcd ehci_hcd hw_random snd_azx snd_hda_codec snd_pcm_oss
snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc tg3 qla2xxx
scsi_transport_fc dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ata_piix libata
sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c016efc4>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-55.ELsmp) 
EIP is at locks_remove_flock+0xa1/0xe1
eax: f69068ac   ebx: f61294b4   ecx: 00000000   edx: 00000081
esi: 00000000   edi: f612940c   ebp: f6e4d180   esp: f62a2f28
ds: 007b   es: 007b   ss: 0068
Process locker (pid: 5333, threadinfo=f62a2000 task=f6d61230)
Stack: 00000007 f6e4d180 00000003 f62a2f5c c011cd42 00000286 f6f81680 f6f81430 
       f6ed5c80 f7e3d980 f6f8159c f5f9cec4 c02d3d56 f62a2fbc c02d3d86 00000000 
       b7fb5d42 f6f81430 f6f81430 00000001 00000000 00000246 00000000 f6e4d180 
Call Trace:
 [<c011cd42>] recalc_task_prio+0x128/0x133
 [<c02d3d56>] schedule+0x84e/0x8ec
 [<c02d3d86>] schedule+0x87e/0x8ec
 [<c015bd82>] __fput+0x41/0x100
 [<c016ab21>] sys_fcntl64+0x79/0x80
 [<c02d5ee3>] syscall_call+0x7/0xb
Code: 38 39 68 2c 75 2d 0f b6 50 30 f6 c2 02 74 09 89 d8 e8 9f df ff ff eb 1d f6
c2 20 74 0e ba 02 00 00 00 89 d8 e8 ba ec ff ff eb 0a <0f> 0b 06 07 33 ad 2e c0
89 c3 8b 03 eb c4 b8 00 f0 ff ff 21 e0 
Comment 9 Bryn M. Reeves 2008-07-22 17:48:32 EDT
Created attachment 312399 [details]
log of BUG in comment #8
Comment 10 Bryn M. Reeves 2008-07-22 18:01:43 EDT
This seems to be the same problem discussed here:

http://lkml.org/lkml/2006/2/17/135

And this was also reported back in 2004:

http://lkml.org/lkml/2004/10/1/116

Chris Wright provided a workaround patch for testing:

http://lkml.org/lkml/2004/10/1/276

The patch adds a check in the return path from sys_fcntl that the current entry
in the file table for our fd matches the file we're holding.

Comment 11 Bryn M. Reeves 2008-07-22 18:05:39 EDT
Created attachment 312401 [details]
Patch from http://lkml.org/lkml/2004/10/1/276
Comment 12 Bryn M. Reeves 2008-07-22 18:29:18 EDT
Discussed this with Al Viro who mentioned:

linux-2.6-fs-race-condition-in-dnotify.patch (and bug 443440 / bug 439759)

Which fixes a similar fcntl/close race (I can't see the bugs but the patch
header describes a reordering bug updating i_flock).

I can't seem to trigger this on 2.6.9-78.EL with the steps in comment #7 but I
think that might be down to the test case. This can be triggered using those
steps on 2.6.18-92.1.6.el5 which does contain this patch.
Comment 13 Bryn M. Reeves 2008-07-22 18:36:03 EDT
Created attachment 312402 [details]
reproducer - lock file in one thread and close in another

Slightly less stupid test case. Still doesn't seem to trigger it on 2.6.9-78.EL
so it may already be fixed there?
Comment 15 Eugene Teo (Security Response) 2008-07-23 02:31:35 EDT
Used gcc-4.1.2-42.el5, and it complained "locker.c:13: error: expected
declaration specifiers or '...' before 'gettid'". Modified locker.c as follows:

 13 /* _syscall0(pid_t,gettid); */
 14 pid_t gettid(void)
 15 {
 16         return syscall(__NR_gettid);
 17 }
Comment 16 Eugene Teo (Security Response) 2008-07-23 03:26:19 EDT
So I tested this on 2.6.18-92.el5, and given the test scenario in comment #7, I
managed to trigger the problem after numerous tries.

intel-s5000phb-01.rhts.bos.redhat.com login: ------------[ cut here ]------------
kernel BUG at fs/locks.c:2054!
invalid opcode: 0000 [#1]
SMP 
last sysfs file:
/devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:07:00.1/irq
Modules linked in: nfs fscache nfsd exportfs lockd nfs_acl auth_rpcgss autofs4
hidp rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api dm_multipath video
sbs backlight i2c_ec button battery asus_acpi ac parport_pc lp parport sg
i5000_edac i2c_i801 ide_cd edac_mc cdrom i2c_core pcspkr e1000e serio_raw
dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata mptsas mptscsih mptbase
scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
CPU:    3
EIP:    0060:[<c0482513>]    Not tainted VLI
EFLAGS: 00010246   (2.6.18-92.el5 #1) 
EIP is at locks_remove_flock+0xc1/0xde
eax: f6e5c41c   ebx: d88a89f0   ecx: 00000286   edx: 00000000
esi: f7b44cc0   edi: f7b44cc0   ebp: d88a8950   esp: da0fdf18
ds: 007b   es: 007b   ss: 0068
Process locker2 (pid: 16672, ti=da0fd000 task=c1a1b000 task.ti=da0fd000)
Stack: 00000000 00000000 00000000 00000000 00000000 f7b44cc0 0000411f 00000000 
       00000000 00000000 f7b44cc0 00000212 00000000 00000000 ffffffff 7fffffff 
       00000000 00000000 00000000 00000000 00000000 db1375a0 00100100 00200200 
Call Trace:
 [<c047189b>] __fput+0x88/0x167
 [<c047fdf6>] sys_fcntl64+0x79/0x80
 [<c0404eff>] syscall_call+0x7/0xb
 =======================
Code: 34 39 70 28 75 2d 0f b6 50 2c f6 c2 02 74 09 89 d8 e8 b7 fe ff ff eb 1d 80
e2 20 74 0e ba 02 00 00 00 89 d8 e8 12 ff ff ff eb 0a <0f> 0b 06 08 6d f4 62 c0
89 c3 8b 03 85 c0 75 c6 e8 12 71 18 00 
EIP: [<c0482513>] locks_remove_flock+0xc1/0xde SS:ESP 0068:da0fdf18
 <0>Kernel panic - not syncing: Fatal exception
Comment 17 Eugene Teo (Security Response) 2008-07-23 03:54:31 EDT
Bryn, I managed to trigger the problem on rhel-4 with kernel-2.6.9-78.ELsmp, so
it's not fixed. I am going to redo the test with the latest kernel to make sure.

dell-pesc420-01.rhts.bos.redhat.com login: ------------[ cut here ]------------
kernel BUG at fs/locks.c:1823!
invalid operand: 0000 [#1]
SMP 
Modules linked in: nfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4
sunrpc cpufreq_powersave loop button battery ac uhci_hcd ehci_hcd tg3 floppy
dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod ata_piix libata sd_mod scsi_mod
CPU:    0
EIP:    0060:[<c0170a30>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-78.ELsmp) 
EIP is at locks_remove_flock+0xa1/0xe1
eax: c2def78c   ebx: f6320214   ecx: 00000000   edx: 00000081
esi: 00000000   edi: f632016c   ebp: f6807a80   esp: d1253f24
ds: 007b   es: 007b   ss: 0068
Process locker2 (pid: 21124, threadinfo=d1253000 task=f71ce630)
Stack: f632016c 00000007 f6807a80 00000003 00000003 f426700c 00000286 00000046 
       f426700c c02014f4 00000000 00000286 00000020 c0202299 00000020 00000020 
       00000000 b7fb7d44 c0206b2f f426700c 00000001 00000246 00000000 f6807a80 
Call Trace:
 [<c02014f4>] tty_ldisc_deref+0x50/0x5f
 [<c0202299>] tty_write+0x258/0x262
 [<c0206b2f>] write_chan+0x0/0x1d0
 [<c015d39a>] __fput+0x45/0x11a
 [<c016c4ed>] sys_fcntl64+0x79/0x80
 [<c02e09db>] syscall_call+0x7/0xb
Code: 38 39 68 2c 75 2d 0f b6 50 30 f6 c2 02 74 09 89 d8 e8 ff de ff ff eb 1d f6
c2 20 74 0e ba 02 00 00 00 89 d8 e8 72 ec ff ff eb 0a <0f> 0b 1f 07 ec 5a 2f c0
89 c3 8b 03 eb c4 b8 00 f0 ff ff 21 e0 
 <0>Fatal exception: panic in 5 seconds
Kernel panic - not syncing: Fatal exception
Comment 23 Bryn M. Reeves 2008-07-23 12:52:02 EDT
Eugene mentioned the revised test case was also failing on git HEAD with an
EBADF from locker. I guess I'll have to add real synchronisation between the
threads...  (it's hitting the close before the fcntl).
Comment 27 Bryn M. Reeves 2008-08-20 12:27:17 EDT
Created attachment 314636 [details]
Patch to check for fcntl/close race and clean up POSIX locks

Chris's original patch from LKML ported to RHEL4 & fcheck.
Comment 28 Bryn M. Reeves 2008-08-20 12:29:40 EDT
Al, the patch in comment #27 is still not pretty but it does seem like it should stop the BUG(). Any feedback on this? I see some not disimilar checks in related code but I don't have the experience with this part of the kernel to say if this is viable or not.
Comment 33 Alexander Viro 2008-08-27 03:34:56 EDT
I don't believe that it's going to help.  That patch is equivalent
to what's *already* done in locks.c.  Do we have it reproduced on
mainline kernel?
Comment 37 Bryn M. Reeves 2008-09-15 07:24:25 EDT
We seem to have some positive testing results with the patch in comment #27.
Comment 38 Bryn M. Reeves 2008-09-16 13:15:14 EDT
Eugene - I was about to post this to LKML but I realised we don't actually have testing on current upstream (my last test was from git about a month ago & I've long thrown that kernel away).

I have a build from last night's git ready to go on x86_64 but you/Harald seem to be the ones who can trigger this.. I'll give it some testing here and see if I can get it to go but failing that would be great if you or Harald can try it.

Harald's latest results are for 78.0.1.EL vs. 78.0.1.bz456282.EL (just the patch from comment #27). He can reliably trigger the BUG on the former but has been unable to trigger it on 78.0.1.bz456282.EL. I'm looking into a console message he reported with only the patched kernel now:

    nlmclnt_lock: VFS is out of sync with lock manager!

Will update with further information when I have it.
Comment 48 Bryn M. Reeves 2008-10-24 07:22:40 EDT
Output from a kernel with the lockdump patch applied:

Oct 24 10:56:23  [...network console startup...]
Oct 24 10:56:44 192.168.122.67  fs/locks.c: invalid struct file_lock at c1950530
Oct 24 10:56:44 192.168.122.67  {   
Oct 24 10:56:44 192.168.122.67               fl_next 00000000
Oct 24 10:56:44 192.168.122.67               fl_link (next=c19503e4, prev=c037358c)
Oct 24 10:56:44 192.168.122.67              fl_block (next=c195053c, prev=c195053c)
Oct 24 10:56:44 192.168.122.67              fl_owner e5e01900
Oct 24 10:56:44 192.168.122.67                fl_pid 6232
Oct 24 10:56:44 192.168.122.67               fl_file e5b7e340
Oct 24 10:56:44 192.168.122.67              fl_flags 0x81
Oct 24 10:56:44 192.168.122.67               fl_type F_WRLCK
Oct 24 10:56:44 192.168.122.67              fl_start 0
Oct 24 10:56:44 192.168.122.67                fl_end 0
Oct 24 10:56:44 192.168.122.67             fl_fasync 00000000
Oct 24 10:56:44 192.168.122.67         fl_break_time 0
Oct 24 10:56:44 192.168.122.67                fl_ops f8b12a8c []
Oct 24 10:56:44 192.168.122.67              fl_lmops 00000000 []
Oct 24 10:56:44 192.168.122.67           fl_u.nfs_fl { state = 0, flags = 0x0 (), owner = dcd72ee0 }
Oct 24 10:56:44 192.168.122.67  }
Oct 24 10:56:44  ------------[ cut here ]------------
Oct 24 10:56:44  kernel BUG at fs/locks.c:1871!
Oct 24 10:56:44  invalid operand: 0000 [#1]
Oct 24 10:56:44  Modules linked in: nfs lockd nfs_acl netconsole parport_pc lp parport autofs4 sunrpc ib_srp scsi_mod ib_sdp ib_ipoib md5 ipv6 rdma_ucm rdma_cm ib_local_sa iw_cm ib_addr ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core loop button battery ac uhci_hcd 8139cp mii floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
Oct 24 10:56:44  CPU:    0
Oct 24 10:56:44  EIP:    0060:[<c018653b>]    Not tainted VLI
Oct 24 10:56:44  EFLAGS: 00010292   (2.6.9-67.0.15.0.bz456282.bmr0.EL)
Oct 24 10:56:44  EIP is at locks_remove_flock+0x11e/0x1b9
Oct 24 10:56:44  eax: 0000004b   ebx: c1950530   ecx: c0333136   edx: dae9cee0
Oct 24 10:56:45  esi: daed2bf8   edi: daed2b20   ebp: e5b7e340   esp: dae9cf18
Oct 24 10:56:45  ds: 007b   es: 007b   ss: 0068
Oct 24 10:56:45  Process locker (pid: 6233, threadinfo=dae9c000 task=dae9a2b0)
Oct 24 10:56:45  Stack: 00000003 fffffe00 c0185b4b daed2b20 00000007 e5b7e340 00000003 00000000
Oct 24 10:56:45 192.168.122.67        00000046 f631100c 00000286 00000000 c022d46d 00000286 0000001e f6311000
Oct 24 10:56:45 192.168.122.67        c022e843 0000001e b7f5cd42 c0234c2a f631100c 0000001e b7f5cd24 00000001
Oct 24 10:56:45  Call Trace:
Oct 24 10:56:45 192.168.122.67  [<c0185b4b>] fcntl_setlk+0x3e2/0x3ec
Oct 24 10:56:45 192.168.122.67  [<c022d46d>] tty_ldisc_deref+0xc8/0x139
Oct 24 10:56:45 192.168.122.67  [<c022e843>] tty_write+0x381/0x38b
Oct 24 10:56:45 192.168.122.67  [<c0234c2a>] write_chan+0x0/0x1d5
Oct 24 10:56:45 192.168.122.67  [<c016e066>] __fput+0x41/0xee
Oct 24 10:56:45 192.168.122.67  [<c018145e>] sys_fcntl64+0x79/0x80
Oct 24 10:56:45 192.168.122.67  [<c031e103>] syscall_call+0x7/0xb
Oct 24 10:56:45  Code: 32 0f b6 43 40 a8 02 74 09 89 f0 e8 02 d5 ff ff eb 23 a8 20 74 0e ba 02 00 00 00 89 f0 e8 b0 e5 ff ff eb 11 89 d8 e8 a7 fd ff ff <0f> 0b 4f 07 fa 2e 33 c0 89 de 8b 1e eb bf b8 00 f0 ff ff 21 e0
Oct 24 10:56:45 192.168.122.67  <0>Fatal exception: panic in 5 seconds
Oct 24 10:56:49  Kernel panic - not syncing: Fatal exception
Comment 49 Bryn M. Reeves 2008-10-24 07:35:53 EDT
fl_flags 0x81 == FL_POSIX | FL_SLEEP
Comment 50 Bryn M. Reeves 2008-10-24 11:48:34 EDT
Another crash with the debug kernel. This time I added calls to dump_struct_lock_info() inside locks_remove_posix as well. The lock that we end up BUG()ging on was not removed by locks_remove_posix (despite being having fl_flags == 0x81):

Oct 24 16:58:14  locks_remove_posix: NOT removing lock at c194ebc0
Oct 24 16:58:14  fs/locks.c: dumping struct file_lock at c194ebc0
Oct 24 16:58:14  {
Oct 24 16:58:14               fl_next 00000000
Oct 24 16:58:14               fl_link (next=c194e224, prev=c037458c)
Oct 24 16:58:14              fl_block (next=c194ebcc, prev=c194ebcc)
Oct 24 16:58:15              fl_owner e6700200
Oct 24 16:58:15                fl_pid 3928
Oct 24 16:58:15               fl_file e4ae9400
Oct 24 16:58:15              fl_flags 0x81
Oct 24 16:58:15               fl_type F_RDLCK
Oct 24 16:58:15              fl_start 4
Oct 24 16:58:15                fl_end 4
Oct 24 16:58:15             fl_fasync 00000000
Oct 24 16:58:15         fl_break_time 0
Oct 24 16:58:15                fl_ops 00000000 []
Oct 24 16:58:15              fl_lmops 00000000 []
Oct 24 16:58:15           fl_u.nfs_fl { state = 0, flags = 0x0 (n/a), owner = 00000000 }
Oct 24 16:58:15  }
[...]
Oct 24 16:59:06  fs/locks.c:locks_remove_flock() invalid file_lock
Oct 24 16:59:06  fs/locks.c: dumping struct file_lock at c194ebc0
Oct 24 16:59:06  {
Oct 24 16:59:06               fl_next 00000000
Oct 24 16:59:06               fl_link (next=c194e224, prev=c037458c)
Oct 24 16:59:06              fl_block (next=c194ebcc, prev=c194ebcc)
Oct 24 16:59:06              fl_owner e6700040
Oct 24 16:59:06                fl_pid 4647
Oct 24 16:59:06               fl_file e40623c0
Oct 24 16:59:06              fl_flags 0x81
Oct 24 16:59:06               fl_type F_WRLCK
Oct 24 16:59:06              fl_start 0
Oct 24 16:59:06                fl_end 0
Oct 24 16:59:06             fl_fasync 00000000
Oct 24 16:59:06         fl_break_time 0
Oct 24 16:59:06                fl_ops f8b12a8c []
Oct 24 16:59:06              fl_lmops 00000000 []
Oct 24 16:59:06           fl_u.nfs_fl { state = 0, flags = 0x0 (n/a), owner = e455c4e0 }
Oct 24 16:59:06  }
Oct 24 16:59:06  ------------[ cut here ]------------
Oct 24 16:59:06  kernel BUG at fs/locks.c:1876!
Oct 24 16:59:06  invalid operand: 0000 [#1]
Oct 24 16:59:06  Modules linked in: nfs lockd nfs_acl netconsole parport_pc lp parport autofs4 sunrpc ib_srp scsi_mod ib_sdp ib_ipoib md5 ipv6 rdma_ucm rdma_cm ib_local_sa iw_cm ib_addr ib_umad ib_ucm ib_uverbs ib_cm ib_sa ib_mad ib_core loop button battery ac uhci_hcd 8139cp mii floppy dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
Oct 24 16:59:07  CPU:    0
Oct 24 16:59:07  EIP:    0060:[<c018656b>]    Not tainted VLI
Oct 24 16:59:07  EFLAGS: 00010286   (2.6.9-67.0.15.0.bz456282.bmr1.EL)
Oct 24 16:59:07  EIP is at locks_remove_flock+0x128/0x1c4
Oct 24 16:59:07  eax: 0000004f   ebx: c194ebc0   ecx: c033315a   edx: dae2fedc
Oct 24 16:59:07  esi: db5f15a0   edi: db5f14c8   ebp: e40623c0   esp: dae2ff14
Oct 24 16:59:07  ds: 007b   es: 007b   ss: 0068
Oct 24 16:59:07  Process locker (pid: 4653, threadinfo=dae2f000 task=dae8d1a0)
Oct 24 16:59:07  Stack: c03331ff 00000003 fffffe00 c0185b4b db5f14c8 00000007 e40623c0 00000003
Oct 24 16:59:07        00000000 00000046 e898000c 00000286 00000000 c022d49d 00000286 0000001e
Oct 24 16:59:07        e8980000 c022e873 eab64e10 dae2ff88 c011f43a e898000c 0000001e b7f03d24
Oct 24 16:59:07  Call Trace:
Oct 24 16:59:07  [<c0185b4b>] fcntl_setlk+0x3e2/0x3ec
Oct 24 16:59:07  [<c022d49d>] tty_ldisc_deref+0xc8/0x139
Oct 24 16:59:07  [<c022e873>] tty_write+0x381/0x38b
Oct 24 16:59:07  [<c011f43a>] recalc_task_prio+0x128/0x133
Oct 24 16:59:07  [<c016e066>] __fput+0x41/0xee
Oct 24 16:59:07  [<c018145e>] sys_fcntl64+0x79/0x80
Oct 24 16:59:07  [<c031e133>] syscall_call+0x7/0xb
Oct 24 16:59:07  Code: f0 e8 dc d4 ff ff eb 2e a8 20 74 0e ba 02 00 00 00 89 f0 e8 8a e5 ff ff eb 1c 68 ff 31 33 c0 e8 a9 e3 f9 ff 89 d8 e8 6d fb ff ff <0f> 0b 54 07 19 2f 33 c0 59 89 de 8b 1e eb b4 b8 00 f0 ff ff 21
Oct 24 16:59:07  <0>Fatal exception: panic in 5 seconds
Oct 24 16:59:11  Kernel panic - not syncing: Fatal exception

Since we know that IS_POSIX() for this lock will return true, this must have been because posix_same_owner(fl, &lock) for the two locks returned false:

        while (*before != NULL) {
                struct file_lock *fl = *before;
                if (IS_POSIX(fl) && posix_same_owner(fl, &lock)) {
                        printk(KERN_WARNING "locks_remove_posix: removing lock at %p", fl);
                        dump_struct_file_lock(fl);
                        locks_delete_lock(before);
                        continue;
                }
                printk(KERN_WARNING "locks_remove_posix: NOT removing lock at %p", fl);
                dump_struct_file_lock(fl);
                before = &fl->fl_next;
        }

Just trying to get a vmcore from one of these crashes so I can take a look at the state of the lock list in more detail.
Comment 51 Bryn M. Reeves 2008-10-24 12:06:08 EDT
So, I don't really understand this code, so please forgive my ignorance but I don't understand how this block of code can skip cleaning up some POSIX locks:

                struct file_lock *fl = *before;
                if (IS_POSIX(fl) && posix_same_owner(fl, &lock)) {
                        printk(KERN_WARNING "locks_remove_posix: removing lock
at %p", fl);
                        dump_struct_file_lock(fl);
                        locks_delete_lock(before);
                        continue;
                }

If we're going to not clean up these locks due to the posix_same_owner() check then we *know* that we will bug on one of those locks when this fd is shut down by the final call to fput().

I guess I don't understand when it would be valid to be returning from sys_close() -> locks_remove_posix() leaving FL_POSIX locks remaining on the list.
Comment 52 Bryn M. Reeves 2008-10-24 12:14:17 EDT
Because it also gets called from nfs4state.c:release_stateid() I guess
Comment 59 Harald Klein 2008-10-30 12:04:35 EDT
I've just tried to crash 78.0.1.0.bz456282 (see comment 38) again. No chance.

best regards,
Hari
Comment 60 Eugene Teo (Security Response) 2008-10-30 21:30:17 EDT
(In reply to comment #59)
> I've just tried to crash 78.0.1.0.bz456282 (see comment 38) again. No chance.

Thanks Hari. Did you try to trigger it on the latest upstream kernel?

Eugene
Comment 61 Harald Klein 2008-10-31 10:49:41 EDT
(In reply to comment #60)
just tested with 2.6.27.4 on 32bit. No crash, but I noticed those messages:

mount used greatest stack depth: 1824 bytes left
locker used greatest stack depth: 1788 bytes left
locker used greatest stack depth: 1732 bytes left
locker used greatest stack depth: 1660 bytes left

Hari
Comment 62 Eric Sandeen 2008-10-31 12:45:25 EDT
Those messages are generally harmless; each time a thread hits a new high-water (low-water?) mark, it prints this message.  1k+ is still a bit of room....    (though if this was on x86_64, with 8k of stack, that may be worrisome for x86, with only 4k)
Comment 63 Harald Klein 2008-10-31 12:58:10 EDT
test in comment #61 was done on a 32bit KVM instance.
Comment 65 Bryn M. Reeves 2008-12-04 11:09:19 EST
*** Bug 464763 has been marked as a duplicate of this bug. ***
Comment 66 Bryn M. Reeves 2008-12-04 11:20:31 EST
Copying over Uli's problem description from bug 464763:

Comment #1 From  Ulrich Obergfell (uobergfe@redhat.com)  2008-09-30 07:52:28

problem description
-------------------


Kernel panics reproducibly with the following footprint:


kernel BUG at fs/locks.c:1823!
invalid operand: 0000 [#1]
SMP
Modules linked in: nfs nfsd exportfs
lockd nfs_acl parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc loop button 
battery ac md5 ipv6 pcnet32 mii floppy dm_snapshot dm_zero dm_mirror ext3 jbd
dm
_mod ata_piix libata mptscsih mptsas mptspi mptscsi mptbase sd_mod scsi_mod
CPU:    1
EIP:    0060:[<c0170a30>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-78.ELsmp)
EIP is at locks_remove_flock+0xa1/0xe1
eax: cf029c0c   ebx: ca64936c   ecx: 00000000   edx: 00000001
esi: 00000000   edi: ced6940c   ebp: c6580ec0   esp: ce107f24
ds: 007b   es: 007b   ss: 0068
Process flocktest (pid: 15982, threadinfo=ce107000 task=c63ef6b0)
Stack: ced6940c 00000006 c6580ec0
00000003 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000246 00000000 c6580ec0
Call Trace:
 [<c015d39a>] __fput+0x45/0x11a
 [<c016c4ed>] sys_fcntl64+0x79/0x80
 [<c02e09db>] syscall_call+0x7/0xb
Code: 38 39 68 2c 75 2d 0f b6 50 30 f6 c2 02 74 09 89 d8 e8 ff de ff ff eb 1d
f6
 c2 20 74 0e ba 02 00 00 00 89 d8 e8 72 ec ff ff eb 0a <0f> 0b 1f 07 d9 5a 2f
c0
 89 c3 8b 03 eb c4 b8 00 f0 ff ff 21 e0
 <0>Fatal exception: panic in 5 seconds


The multithreaded "flocktest" process opens files via NFS and uses the fcntl()
system call to lock ranges of bytes in those files. The flow of execution thru
the fcntl() system call that is suspected to cause the problem is as follows:


sys_fcntl
  filp = fget(fd)
           increment f_count

  do_fcntl
  {
    switch (cmd)
    case F_SETLK:
         fcntl_setlk
         {
          again:

           // filp->f_op->lock
           nfs_lock
           {
             do_setlk
             {
               // NFS_PROTO(inode)->lock
               nfs3_proc_lock
                 nlmclnt_proc - nlmclnt_lock - nlmclnt_call - rpc_call_sync

               if ("NFS_PROTO(inode)->lock" returned -EINTR)
                  posix_lock_file_wait
                    Create a local FL_POSIX lock. ...........................
                                                                            :
               return (status from "NFS_PROTO(inode)->lock")                :
             }                                                              :
           }                                                                :
                                                                            :
           // Attempt to detect a close/fcntl race and recover by releasing :
           // the lock that was just acquired.                              :
                                                                            :
           if ("do_setlk/nfs_lock" returned zero && file has been closed)   :
              set l_type = F_UNLCK                                          :
              goto again                                                    :
         }                                                                  :
  }                                                                         :
                                                                            :
  fput(filp)                                                                :
    decrement f_count                                                       :
    if (f_count is zero)                                                    :
       __fput                                                               :
         locks_remove_flock                                                 :
           BUG() because the local FL_POSIX lock created by do_setlk() .....:
                 still exists.


One thread in the process executes the fcntl() system call. sys_fcntl() calls
fget() which increments the reference count "f_count" of the file structure.
Let's assume that "f_count" is now equal two. The thread subsequently issues
an RPC call in nfs3_proc_lock() and gets blocked while waiting for a response
from the NFS server.

Another thread in the process closes the file. filp_close() calls fput() which
decrements "f_count" but does not call __fput() because the reference count is
still greater than zero (there is a reference from sys_fcntl()).

The thread that is waiting for a response from the NFS server is interrupted
by a signal, so "NFS_PROTO(inode)->lock" returns -EINTR and do_setlk() calls
posix_lock_file_wait() in order to create a local FL_POSIX lock. fcntl_setlk()
executes the following code after "filp->f_op->lock" returns:


        spin_lock(&current->files->file_lock);
        f = fcheck(fd);
        spin_unlock(&current->files->file_lock);
        if (!error &&
             ^^^^^ is -EINTR
            cmd != F_UNLCK && f != filp && flock.l_type != F_UNLCK) {
                flock.l_type = F_UNLCK;
                goto again;
        }


The fragment of code from fcntl_setlk() attempts to detect a close/fcntl race
and recover by releasing the lock that was just acquired. However, since error
is -EINTR in the scenario outlined above, the body of the "if" statement is not
executed and thus the local FL_POSIX lock created by do_setlk() is not
released.

When sys_fcntl() subsequently calls fput() "f_count" goes down to zero (no more
references). locks_remove_flock() BUG's because the local FL_POSIX lock created
by do_setlk() still exists.
Comment 67 Bryn M. Reeves 2008-12-04 11:22:00 EST
Comment #2 From  Ulrich Obergfell (uobergfe@redhat.com)  2008-09-30 07:57:43 EDT

The problem exists in kernel 2.6.9-78 (RHEL 4.7) and in 2.6.18-53 (RHEL 5.1).
Panic is reproducible on a lab machine with an instrumented 2.6.18-53 kernel.


Please find below:

1. description of instrumentation
2. description of method used to reproduce
3. details from a vmcore
4. source code of test program "t.c"



1. instrumentation
   ---------------


The instrumentation consists of a printk() in do_setlk() and three printk()'s
in fcntl_setlk() in order to trace the flow of execution through the suspected
code.


-  code fragment from fs/nfs/file.c:

static int do_setlk(struct file *filp, int cmd, struct file_lock *fl)
{
        ...
        if (!(NFS_SERVER(inode)->flags & NFS_MOUNT_NONLM)) {
                status = NFS_PROTO(inode)->lock(filp, cmd, fl);
                ...
                if (status == -EINTR || status == -ERESTARTSYS) {
                        printk("do_setlk: status=%d\n", status); <-------- (1)
                        do_vfs_lock(filp, fl);
                }
        } else
                status = do_vfs_lock(filp, fl);
        ...
        if (status < 0)
                goto out;
        ...
out:
        return status;
}


-  code fragment from fs/locks.c:

int fcntl_setlk(unsigned int fd, struct file *filp, unsigned int cmd,
                struct flock __user *l)
{
        ...
        if (filp->f_op && filp->f_op->lock != NULL) {
                error = filp->f_op->lock(filp, cmd, file_lock);
                if (error)
                        printk("fcntl_setlk: error=%d\n", error); <------- (2)
        }
        else {
                ...
        }
        ...
        if ( error && fcheck(fd) != filp && flock.l_type != F_UNLCK)
                printk("fcntl_setlk: *no* 'goto again'\n"); <------------- (3)

        if (!error && fcheck(fd) != filp && flock.l_type != F_UNLCK) {
                flock.l_type = F_UNLCK;
                printk("fcntl_setlk: 'goto again'\n");
                goto again;
        }       ^^^^^^^^^^
        ...    not executed
}



2. method
   ------


Note :

The messages from do_setlk() and fcntl_setlk() indicate that the printk()'s at
(1), (2), and (3) were actually executed. The error / status was -4 == -EINTR.
The body of the "if" statement that contains the "goto again" was not executed.


# echo "test" > /tmp/testfile

# mount -o intr,hard localhost:/tmp /mnt

# ./t /mnt/testfile

Shutting down NFS mountd: [  OK  ]
Shutting down NFS daemon:
nfsd: last server has exited
nfsd: unexporting all filesystems
[  OK  ]
Shutting down NFS quotas: [  OK  ]
Shutting down NFS services:  [  OK  ]

do_setlk: status=-4
fcntl_setlk: error=-4
fcntl_setlk: *no* 'goto again'

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/locks.c:1999
invalid opcode: 0000 [1] SMP 
last sysfs file: /kernel/kexec_crash_loaded
CPU 6 
Modules linked in: nfs(U) fscache(U) nfsd(U) exportfs(U) lockd(U) nfs_acl(U)
aut
ofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) bonding(U) ipv6(U)
dm_
multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) i2c_core(U) button(U)
batter
y(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U)
joydev
(U) shpchp(U) pcspkr(U) bnx2(U) ide_cd(U) cdrom(U) sg(U) dm_snapshot(U)
dm_zero(
U) dm_mirror(U) dm_mod(U) ata_piix(U) libata(U) megaraid_sas(U) sd_mod(U)
scsi_m
od(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
Pid: 4618, comm: t Not tainted 2.6.18-53.el5_UliO_Test #9
RIP: 0010:[<ffffffff80027074>]  [<ffffffff80027074>]
locks_remove_flock+0xe4/0x122
RSP: 0018:ffff8109a6905e48  EFLAGS: 00010246
RAX: ffff8109a6815298 RBX: ffff8109a639b2f8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000286 RDI: ffffffff802fee80
RBP: ffff8109bcd89380 R08: ffff8109b2824580 R09: 0000000000000020
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8109a639b1f8
R13: ffff8109a639b1f8 R14: ffff8101222ae480 R15: ffff8109a6ca4a98
FS:  0000000040a00940(0063) GS:ffff8109bfc44b40(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000360c66c0c7 CR3: 00000009af07b000 CR4: 00000000000006e0
Process t (pid: 4618, threadinfo ffff8109a6904000, task ffff8109bcf23080)
Stack:  0000000000000000 0000000000000000 0000000000000000 0000000000000000
 0000000000000000 ffff8109bcd89380 00000000000011e1 0000000000000000
 0000000000000000 0000000000000000 ffff8109bcd89380 0000000000000212
Call Trace:
 [<ffffffff80012302>] __fput+0x94/0x198
 [<ffffffff8002e2d2>] sys_fcntl+0x2d0/0x2dc
 [<ffffffff8005c28d>] tracesys+0xd5/0xe0

Code: 0f 0b 68 aa ee 28 80 c2 cf 07 48 89 c3 48 8b 03 48 85 c0 75 
RIP  [<ffffffff80027074>] locks_remove_flock+0xe4/0x122
 RSP <ffff8109a6905e48>



3. details from a vmore
   --------------------


   1963 void locks_remove_flock(struct file *filp)
   1964 {
     :
   1985         lock_kernel();
   1986         before = &inode->i_flock;
   1987 
   1988         while ((fl = *before) != NULL) {
   1989                 if (fl->fl_file == filp) {
   1990                         if (IS_FLOCK(fl)) {
   1991                                 locks_delete_lock(before);
   1992                                 continue;
   1993                         }
   1994                         if (IS_LEASE(fl)) {
   1995                                 lease_modify(before, F_UNLCK);
   1996                                 continue;
   1997                         }
   1998                         /* What? */
-> 1999 <-                      BUG();
   2000                 }
   2001                 before = &fl->fl_next;
   2002         }
   2003         unlock_kernel();
   2004 }


crash> sys
      KERNEL: vmlinux
    DUMPFILE: vmcore
        CPUS: 16
        DATE: Mon Sep 29 12:00:29 2008
      UPTIME: 00:04:45
LOAD AVERAGE: 0.40, 0.38, 0.18
       TASKS: 249
    NODENAME: ibm-x3850m2-1
     RELEASE: 2.6.18-53.el5_UliO_Test
     VERSION: #9 SMP Mon Sep 29 11:46:37 EDT 2008
     MACHINE: x86_64  (2932 Mhz)
      MEMORY: 37.5 GB
       PANIC: ""


crash> log | grep _setlk
do_setlk: status=-4
fcntl_setlk: error=-4
fcntl_setlk: *no* 'goto again'

crash> log | grep BUG
Kernel BUG at fs/locks.c:1999


crash> bt
PID: 4618   TASK: ffff8109bcf23080  CPU: 6   COMMAND: "t"
 #0 [ffff8109a6905ca0] die at ffffffff8006a681
 #1 [ffff8109a6905cd0] do_invalid_op at ffffffff8006ac37
 #2 [ffff8109a6905d90] error_exit at ffffffff8005cde9
    [exception RIP: locks_remove_flock+228]
    RIP: ffffffff80027074  RSP: ffff8109a6905e48  RFLAGS: 00010246
    RAX: ffff8109a6815298  RBX: ffff8109a639b2f8  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 0000000000000286  RDI: ffffffff802fee80
    RBP: ffff8109bcd89380   R8: ffff8109b2824580   R9: 0000000000000020
    R10: 0000000000000000  R11: 0000000000000000  R12: ffff8109a639b1f8
    R13: ffff8109a639b1f8  R14: ffff8101222ae480  R15: ffff8109a6ca4a98
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #3 [ffff8109a6905e40] locks_remove_flock at ffffffff80027045
 #4 [ffff8109a6905f10] __fput at ffffffff80012302
 #5 [ffff8109a6905f50] sys_fcntl at ffffffff8002e2d2
 #6 [ffff8109a6905f80] tracesys at ffffffff8005c28d (via system_call)
    RIP: 000000360d20cddf  RSP: 0000000040a00010  RFLAGS: 00000297
    RAX: ffffffffffffffda  RBX: ffffffff8005c28d  RCX: ffffffffffffffff
    RDX: 0000000040a00110  RSI: 0000000000000006  RDI: 0000000000000003
    RBP: 0000000040a01000   R8: 0000000040a00940   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000297  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000003  R15: 0000000040a00110
    ORIG_RAX: 0000000000000048  CS: 0033  SS: 002b


crash> px ((struct task_struct *)0xffff8109bcf23080)->thread_info.flags
$1 = 0x84
        ^ TIF_SIGPENDING is set

crash> px ((struct task_struct *)0xffff8109bcf23080)->pending.signal.sig
$2 = {0x100}
        ^ SIGKILL is pending


crash> dis 0xffffffff800122fa 3                 // rdi = rbp (file)
0xffffffff800122fa <__fput+140>:                mov    %rbp,%rdi
                                                // locks_remove_flock()
0xffffffff800122fd <__fput+143>:                callq  0xffffffff80026f90
0xffffffff80012302 <__fput+148>:                mov    0x20(%rbp),%rax

crash> dis locks_remove_flock 2
0xffffffff80026f90 <locks_remove_flock>:        push   %r12
                                                // save rbp (file)
0xffffffff80026f92 <locks_remove_flock+2>:      push   %rbp


crash> bt -f
PID: 4618   TASK: ffff8109bcf23080  CPU: 6   COMMAND: "t"
...
 #3 [ffff8109a6905e40] locks_remove_flock at ffffffff80027045
    ffff8109a6905e48: 0000000000000000 0000000000000000
...                                    saved rbp (file)
    ffff8109a6905ef8: 0000000000000010 ffff8109bcd89380 
    ffff8109a6905f08: ffff8109a639b1f8 ffffffff80012302 
...


crash> px *((struct file *)0xffff8109bcd89380)->f_dentry->d_inode->i_flock
$3 = {                               ^
  fl_next = 0x0,                     |
...                                  |
  fl_pid = 0x11e1 = 4577,            |
...                                  |
  fl_file = 0xffff8109bcd89380, -----+
  fl_flags = 0x1, 
               ^ FL_POSIX lock exists for "testfile"
  fl_type = 0x0, 
  fl_start = 0x0, 
  fl_end = 0x0, 
...
}


crash> ps -g 4577
PID: 4577   TASK: ffff8109bac7c080  CPU: 6   COMMAND: "t"
  PID: 4618   TASK: ffff8109bcf23080  CPU: 6   COMMAND: "t"


crash> bt 4577
PID: 4577   TASK: ffff8109bac7c080  CPU: 6   COMMAND: "t"
 #0 [ffff8109a6ca7e38] schedule at ffffffff80061f29
 #1 [ffff8109a6ca7f20] do_exit at ffffffff80015579
 #2 [ffff8109a6ca7f80] tracesys at ffffffff8005c28d (via system_call)
    RIP: 000000360c69585f  RSP: 00007fff9bee28b8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: ffffffff8005c28d  RCX: ffffffffffffffff
    RDX: 0000000000000000  RSI: 000000000000003c  RDI: 0000000000000000
    RBP: 000000360c946878   R8: 00000000000000e7   R9: ffffffffffffffb0
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000000
    R13: 000000360c946878  R14: ffffffff80046f0e  R15: 00007fff9bee29b0
    ORIG_RAX: 00000000000000e7  CS: 0033  SS: 002b



4. source code of test program
   ---------------------------


Note: compile with "cc t.c -lpthread -o t"


#include <sys/stat.h>
#include <stdio.h>
#include <fcntl.h>
#include <errno.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>

int fd;

pthread_t other_thread;

void *other_thread_routine(void *dummy_arg)
{
    struct flock fl;

    fl.l_type   = F_RDLCK;
    fl.l_whence = SEEK_SET;
    fl.l_start  = 0;
    fl.l_len    = 1;

    if (fcntl(fd, F_SETLK, &fl) == -1) {
        perror("fcntl lock");
        goto out;
    }

    fl.l_type   = F_UNLCK;

    if (fcntl(fd, F_SETLK, &fl) == -1)
        perror("fcntl unlock");
out:
    return (void *)0;
}

main(int argc, char *argv[])
{
    pid_t pid;
    int i, ret;

    if ((fd = open(argv[1], O_RDONLY)) == -1) {
        perror("open");
        exit(1);
    }

    system("service nfs stop");

    ret = pthread_create(&other_thread,
                         (pthread_attr_t *)0,
                         other_thread_routine,
                         (void *)0);
    if (ret) {
        perror("pthread_create");
        exit(1);
    }

    sleep(1);

    if (close(fd) == -1) {
        perror("close");
        exit(1);
    }
}
Comment 68 Bryn M. Reeves 2008-12-04 11:37:15 EST
commit c4d7c402b788b73dc24f1e54a57f89d3dc5eb7bc
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Tue Apr 1 20:26:52 2008 -0400

    NFS: Remove the buggy lock-if-signalled case from do_setlk()
    
    Both NLM and NFSv4 should be able to clean up adequately in the case where
    the user interrupts the RPC call...
    
    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 10e8b80..742cb74 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -566,17 +566,9 @@ static int do_setlk(struct file *filp, int cmd, struct file_lock *fl)
 
        lock_kernel();
        /* Use local locking if mounted with "-onolock" */
-       if (!(NFS_SERVER(inode)->flags & NFS_MOUNT_NONLM)) {
+       if (!(NFS_SERVER(inode)->flags & NFS_MOUNT_NONLM))
                status = NFS_PROTO(inode)->lock(filp, cmd, fl);
-               /* If we were signalled we still need to ensure that
-                * we clean up any state on the server. We therefore
-                * record the lock call as having succeeded in order to
-                * ensure that locks_remove_posix() cleans it out when
-                * the process exits.
-                */
-               if (status == -EINTR || status == -ERESTARTSYS)
-                       do_vfs_lock(filp, fl);
-       } else
+       else
                status = do_vfs_lock(filp, fl);
        unlock_kernel();
        if (status < 0)
Comment 69 Bryn M. Reeves 2008-12-04 14:21:54 EST
Created attachment 325729 [details]
Backport c4d7c402b788b73dc24f1e54a57f89d3dc5eb7bc to RHEL4

Backport Trond's patch to remove bogus signal-detection check from do_setlk(). This check is useless (the locking module should be able to clean up properly after being interrupted by a signal) and causes problems since it can leave a half-baked POSIX lock lying around for locks_remove_flock to trip over in the return path from sys_fcntl (after racing with close()).

The original patch uses do_vfs_lock() which is an NFS helper routine not present in RHEL4's 2.6.9 kernel (introduced in 2.6.12). All this does is to make the appropriate posix_lock_file_wait()/flock_lock_file_wait() call and log a warning if the lock manager returns an error. This version just calls posix_lock_file_wait directly and prints the warning directly from do_setlk().
Comment 71 Bryn M. Reeves 2008-12-04 15:10:19 EST
Upstream patch applies OK to RHEL5
Comment 72 Eugene Teo (Security Response) 2008-12-04 22:37:47 EST
Tested with 2.6.9-78.0.1.2.bz456282.EL (see comment #67):
[root@rhel4-as-i386 ~]# ./t /mnt/testfile
Shutting down RPC svcgssd:                                 [FAILED]
Shutting down NFS mountd:                                  [  OK  ]
Shutting down NFS daemon:                                  [  OK  ]
Shutting down NFS quotas:                                  [  OK  ]
Shutting down NFS services:                                [  OK  ]

[...]
nfsd: last server has exited
nfsd: unexporting all filesystems
do_setlk: VFS is out of sync with lock manager!

Can't trigger the BUG().
Comment 73 Eugene Teo (Security Response) 2008-12-04 23:02:23 EST
Can't trigger the BUG() using the testcase in comment #7 too.
Comment 74 Eugene Teo (Security Response) 2008-12-05 02:15:41 EST
Hmm. So I added the patch, and brew built a rhel-5 kernel. Did you see this before?

[root@rhel5-server-i386 ~]# ./t /mnt/testfile
Shutting down NFS mountd:                                  [  OK  ]
Shutting down NFS daemon:                                  [  OK  ]
Shutting down NFS services:                                [  OK  ]

nfsd: last server has exited
nfsd: unexporting all filesystems
lockd: cannot monitor 127.0.0.1
lockd: failed to monitor 127.0.0.1
portmap: server localhost not responding, timed out
RPC: failed to contact portmap (errno -5).
BUG: unable to handle kernel paging request at virtual address 205c1e7f
 printing eip:
c0431a3f
*pde = 00000000
Oops: 0000 [#1]
SMP 
last sysfs file: /block/dm-1/range
Modules linked in: nfs fscache nfsd exportfs lockd nfs_acl auth_rpcgss sunrpc ipt_REJECT ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi ac parport_pc lp parport floppy pcspkr ide_cd i2c_piix4 i2c_core 8139too serio_raw 8139cp mii cdrom dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
CPU:    1
EIP:    0060:[<c0431a3f>]    Not tainted VLI
EFLAGS: 00010297   (2.6.18-125.el5.bz456288 #1) 
EIP is at flush_workqueue+0x3f/0x66
eax: 205c1e7f   ebx: 00000000   ecx: 00000286   edx: 00000000
esi: de78f340   edi: 00000000   ebp: 00000801   esp: dc705ee0
ds: 007b   es: 007b   ss: 0068
Process rpc.nfsd (pid: 1862, ti=dc705000 task=df63e000 task.ti=dc705000)
Stack: e0baf320 de78f340 c0431a87 dd9d80e0 00000000 e0b9b4fd c060fce7 dd9d80c0 
       dd9d80e0 dd9d80c0 00000007 e0b874e5 dd9d80c0 fffffdff e0b11055 e0bad768 
       e0b8749a dfc89a40 00000008 e0b8792a 00000000 e0b87aef e0ba07c0 dfc89a40 
Call Trace:
 [<c0431a87>] cancel_rearming_delayed_workqueue+0x21/0x23
 [<e0b9b4fd>] nfs4_state_shutdown+0x19/0x155 [nfsd]
 [<c060fce7>] _spin_lock_bh+0x8/0x18
 [<e0b874e5>] nfsd_last_thread+0x39/0x62 [nfsd]
 [<e0b11055>] svc_destroy+0x66/0x95 [sunrpc]
 [<e0b8749a>] nfsd_svc+0x1aa/0x1bc [nfsd]
 [<e0b8792a>] write_svc+0x0/0x18 [nfsd]
 [<e0b87aef>] nfsctl_transaction_write+0x36/0x60 [nfsd]
 [<c0497245>] sys_nfsservctl+0x13d/0x17c
 [<c0471319>] do_sys_open+0xa6/0xae
 [<c0404f17>] syscall_call+0x7/0xb
 =======================
Code: 15 94 88 7a c0 5b 5e f7 d0 8b 04 90 e9 54 ff ff ff b8 fc fc 67 c0 e8 69 d4 1d 00 b8 e0 ef 6e c0 e8 83 44 0b 00 eb 18 8b 06 f7 d0 <8b> 04 98 e8 32 ff ff ff ba e0 ef 6e c0 89 d8 e8 83 44 0b 00 83 
EIP: [<c0431a3f>] flush_workqueue+0x3f/0x66 SS:ESP 0068:dc705ee0
 <0>Kernel panic - not syncing: Fatal exception
Comment 76 Eugene Teo (Security Response) 2008-12-05 02:45:57 EST
(In reply to comment #74)
> Hmm. So I added the patch, and brew built a rhel-5 kernel. Did you see this
> before?
> 
> [root@rhel5-server-i386 ~]# ./t /mnt/testfile
> Shutting down NFS mountd:                                  [  OK  ]
> Shutting down NFS daemon:                                  [  OK  ]
> Shutting down NFS services:                                [  OK  ]
> 
> nfsd: last server has exited
> nfsd: unexporting all filesystems
> lockd: cannot monitor 127.0.0.1
> lockd: failed to monitor 127.0.0.1
> portmap: server localhost not responding, timed out
> RPC: failed to contact portmap (errno -5).
> BUG: unable to handle kernel paging request at virtual address 205c1e7f

2.6.27.5-117.fc10.x86_64 seems ok.
Comment 77 Harald Klein 2008-12-05 04:10:03 EST
tested with kernel-2.6.9-78.0.1.2.bz456282.EL.i686.rpm (patch from comment #69), was not able to crash. Plenty of those:
do_setlk: VFS is out of sync with lock manager!

br, Hari
Comment 79 Bryn M. Reeves 2008-12-11 08:05:44 EST
Created attachment 326611 [details]
revised patch
Comment 80 Harald Klein 2008-12-12 07:01:19 EST
tested with kernel-2.6.9-78.0.1.3.bz456282.EL.i686.rpm (patch from comment
#79) and the original reproducer, was not able to crash. I did not get any "out of sync" messages.

br, Hari
Comment 87 Tony Procaccini 2009-01-28 15:14:27 EST
Can someone please repost the tester code from comment #5 no perms to view it.  I am seeing this crash in 2.6.9-42.0.2 rhel4u4 randomly while running a custom application that dont know enough about what it is doing to comment on.  I have applied the revised patch from comment #79 and the issue persists.  I would like to be able to replicate with a set of test code since will not be able to determine what our application is doing when it crashes.  Does this patch depend on any previous patches being applied - as i had a hard time issuing the patch since locks.c was already being modified by a previous patch in 2.6.9-42src tree?  We also have 2.6.9-67 that i can try but looking at the src it is relatively the same code.  Please advise if any insight.  Thanks
Comment 88 Bryn M. Reeves 2009-01-29 05:02:56 EST
The code is in comment #5 and comment #13. Steps to reproduce are listed in comment #7. None of these are private and I've checked the permissions are correct on the attachments themselves.
Comment 89 Tony Procaccini 2009-01-29 15:16:52 EST
Thanks, I canNOT get the server to crash with the test code from comment #5/#13 on UNpatched kernel versions 2.6.9-42.02/-67/and -78.  I CAN get to crash using comment #5 lockit and comment #67 t.c code.  The patched verion of 2.6.9-42.0.2 seems to ride through both test cases.  However I have had an operational crash with the same bug on this patched kernel so I am still try to understand why, maybe locking files differently with application code.
Comment 90 Bryn M. Reeves 2009-01-30 05:47:34 EST
There are several other problems in the locking code in older kernels that can lead to a panic with the same BUG message and backtrace, it's quite possible that you are hitting one of those.
Comment 91 Jeff Layton 2009-01-30 11:38:41 EST
(In reply to comment #87)
> Can someone please repost the tester code from comment #5 no perms to view it. 
> I am seeing this crash in 2.6.9-42.0.2 rhel4u4 randomly while running a custom
> application that dont know enough about what it is doing to comment on.  I have
> applied the revised patch from comment #79 and the issue persists.  I would
> like to be able to replicate with a set of test code since will not be able to
> determine what our application is doing when it crashes.  Does this patch
> depend on any previous patches being applied - as i had a hard time issuing the
> patch since locks.c was already being modified by a previous patch in
> 2.6.9-42src tree?  We also have 2.6.9-67 that i can try but looking at the src
> it is relatively the same code.  Please advise if any insight.  Thanks

The crash signature in this situation is unfortunately somewhat "generic". There are a lot of potential root causes for this problem but the panic will look the same. We've fixed several of these sorts of problems in earlier kernels.

This BZ has been opened specifically to address the problem/reproducer that Bryn and Uli described in comment 67. If you're still seeing panics that look like this, my suggestion would be to try and reproduce them on as recent a kernel as possible.

Even better, if you have a non-critical place to do so would be to test the kernels on my people.redhat.com page:

http://people.redhat.com/jlayton/

They have patches that are slated for RHEL4.8. If you can still reproduce this panic with those, then please open another BZ. It's probable that the underlying cause is different than the one reported here.
Comment 94 Corey Marthaler 2009-02-04 15:28:19 EST
I appear to be hitting this issue in RHEL5, is there a clone of this bug open for RHEL5?
Comment 101 Eugene Teo (Security Response) 2009-02-18 02:44:49 EST
(In reply to comment #94)
> I appear to be hitting this issue in RHEL5, is there a clone of this bug open
> for RHEL5?

Yes, but the bugs are for tracking purposes. Please use this bug for reporting the problem you experienced. Thanks.
Comment 110 Eugene Teo (Security Response) 2009-04-13 23:07:49 EDT
*** Bug 491695 has been marked as a duplicate of this bug. ***
Comment 114 errata-xmlrpc 2009-04-29 05:28:42 EDT
This issue has been addressed in following products:

  MRG for RHEL-5

Via RHSA-2009:0451 https://rhn.redhat.com/errata/RHSA-2009-0451.html
Comment 115 Ryan Ordway 2009-04-30 03:10:05 EDT
I am running into this issue on 5.2. I am getting terrible performance on an NFS mount because of the problem. Will this be fixed prior to 5.4, or will I have to wait until 5.4 for an official kernel package?
Comment 116 Bryn M. Reeves 2009-04-30 10:48:18 EDT
Hello Ryan,

Are you sure that this is the problem you are experiencing? This problem does not result in performance degradation per-se. When it triggers it causes the NFS client to panic with a BUG message (I guess though that if it is causing your nodes to reboot frequently that's going to cause poor perceived performance overall).

Due to the security concerns and wide impact of this bug it is planned to be addressed via an async errata release for all supported RHEL products where this problem exists. A fix for this issue has been proposed and is now undergoing quality assurance and verification prior to final release.

If you have active Red Hat Enterprise Linux support entitlements, please file a case with your Red Hat Global Support Services contacts who will be able to assist you and advise on the availability of fixed packages for affected releases.

Regards,
Comment 117 J. Ryan Earl 2009-04-30 10:57:58 EDT
This issue does not trigger node reboots, the reduced application performance is from reduced NFS performance.  We have empirically measured NFS performance to be low, specifically in the area of lock acquisition.  Processes are blocking for significantly longer waiting on locks when this condition triggers, specifically the one (1) lock that the error condition is being triggered on.

We're using a Kernel we built with the patch, there's simply no other option.  We couldn't wait for a fix.  With 100% assurance I can say that this patch fixed our performance problems with multiple processes synchronizing on an NFS file lock.
Comment 118 J. Ryan Earl 2009-04-30 11:07:39 EDT
(In reply to comment #117)
> This issue does not trigger node reboots, the reduced application performance
> is from reduced NFS performance.  We have empirically measured NFS performance
> to be low, specifically in the area of lock acquisition.  Processes are
> blocking for significantly longer waiting on locks when this condition
> triggers, specifically the one (1) lock that the error condition is being
> triggered on.
> 
> We're using a Kernel we built with the patch, there's simply no other option. 
> We couldn't wait for a fix.  With 100% assurance I can say that this patch
> fixed our performance problems with multiple processes synchronizing on an NFS
> file lock.  

This comment is a bit out of context, I thought I was replying to a different bug (https://bugzilla.redhat.com/show_bug.cgi?id=491695) which happened to be marked as a duplicate of this bug.  I got confused, I only had 1 bug open and thought I was replying to that one so while it's the same issue the response is a bit out of context.

I was only referring to my setup/tests in which I produced the other bug which has the same code solution in the kernel: ie Bug #491695

In my setup, application performance is affected by a dramatic decline in locking performance on NFS.  Not just locking performance, locking correctness I should add.  strace will show a return that a process has successfully acquired a lock, yet it will still be stuck in the fcntl64() call somehow...
Comment 119 Bryn M. Reeves 2009-04-30 11:10:34 EDT
Interesting - that's completely believable. We do have reports of multiple
different NFS problems that are addressed by the patch for this problem
(although this bugzilla is specific to client-side crashes triggered by the
BUG() macro in locks_remove_flock). This is the first report we have of a
significant performance degradation related to this change (e.g. we have
reports of inconsistent lock status between the VFS and NLM leading to locks
that can never be released).
Comment 120 Ryan Ordway 2009-04-30 14:00:37 EDT
Yeah, sorry to muddy the waters. My problem is the same as J. Ryan's (Bug #491695), which was marked as a duplicate of this bug. I'm lucky in the fact that I can work around it by using "nolock", which does fix the problem for me. I don't need locking on *these* clients, because they are just compute nodes that work on different files for each node.
Comment 121 errata-xmlrpc 2009-04-30 17:25:09 EDT
This issue has been addressed in following products:

  Red Hat Enterprise Linux 4

Via RHSA-2009:0459 https://rhn.redhat.com/errata/RHSA-2009-0459.html
Comment 122 errata-xmlrpc 2009-05-07 06:53:28 EDT
This issue has been addressed in following products:

  Red Hat Enterprise Linux 5

Via RHSA-2009:0473 https://rhn.redhat.com/errata/RHSA-2009-0473.html
Comment 123 Linda Wang 2009-05-12 22:36:57 EDT
*** Bug 499089 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.