Bug 241083 - When we run one of our Cadence tools, kernel gets panic at fs/locks.c (line.no 1799
Summary: When we run one of our Cadence tools, kernel gets panic at fs/locks.c (line.n...
Keywords:
Status: CLOSED DUPLICATE of bug 240403
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
: ---
Assignee: Jeff Layton
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-05-23 22:40 UTC by Heath Petty
Modified: 2014-07-25 03:36 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-06-01 20:32:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Heath Petty 2007-05-23 22:40:43 UTC
Description of problem:
When we run one of our Cadence tools, kernel gets panic at fs/locks.c (line.no
1799). Here is the back trace:

Jan 15 23:37:39 ldvlinux33  kernel BUG at fs/locks.c:1799!
Jan 15 23:37:39 ldvlinux33  invalid operand: 0000 [#1]
Jan 15 23:37:39 ldvlinux33  Modules linked in: netconsole nfs mvfs(U) vnode(U)
nfsd exportfs lockd nfs_acl md5 ipv6 parport_pc lp parport autofs4 i2c_dev
i2c_core sunrpc dm_mirror dm_multipath dm_mod joydev uhci_hcd snd_intel8x0
snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc
snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore 8139too mii floppy ext3
jbd aic7xxx sd_mod scsi_mod
Jan 15 23:37:39 ldvlinux33  CPU:    0
Jan 15 23:37:39 ldvlinux33  EIP:    0060:[<c01810c1>]    Tainted: PF     VLI
Jan 15 23:37:39 ldvlinux33  EFLAGS: 00010246   (2.6.9-34.EL)
Jan 15 23:37:39 ldvlinux33  EIP is at locks_remove_flock+0x119/0x1b4
Jan 15 23:37:39 ldvlinux33  eax: f7ccfd10   ebx: f57cebd8   ecx: 00000000   edx:
00000081
Jan 15 23:37:39 ldvlinux33  esi: f3c50480   edi: f57ceb00   ebp: f43ee780   esp:
f5ca0e24
Jan 15 23:37:39 ldvlinux33  ds: 007b   es: 007b   ss: 0068
Jan 15 23:37:39 ldvlinux33  Process ncvhdl_p (pid: 9951, threadinfo=f5ca0000
task=f49bf8f0)
Jan 15 23:37:39 ldvlinux33  Stack: 00000001 f8d1eafe f43ee780 f57cebd8 f5ca0f30
c0180e2d 00000000 00001000
Jan 15 23:37:39 ldvlinux33         00000000 00000000 00000803 00000000 000026df
00000000 006d45de 00000000
Jan 15 23:37:39 ldvlinux33         45abc2eb 00000000 459a9d3d 00000000 459a9d3d
f43ee780 00000201 00000000
Jan 15 23:37:39 ldvlinux33  Call Trace:
Jan 15 23:37:39 ldvlinux33   [<f8d1eafe>] nfs_lock+0x0/0xc7 [nfs]
Jan 15 23:37:39 ldvlinux33   [<c0180e2d>] locks_remove_posix+0x8f/0x20a
Jan 15 23:37:39 ldvlinux33   [<c016993e>] __fput+0x41/0xee
Jan 15 23:37:39 ldvlinux33   [<c016825a>] filp_close+0x59/0x5f
Jan 15 23:37:39 ldvlinux33   [<f8b0440a>] mvop_linux_close_kernel+0xb/0x12 [vnode]
Jan 15 23:37:39 ldvlinux33   [<f8b0386d>] mvop_linux_close+0x78/0x9c [vnode]
Jan 15 23:37:39 ldvlinux33   [<f8c2624d>] mvfs_closev_ctx+0x15d/0x230 [mvfs]
Jan 15 23:37:39 ldvlinux33   [<f8b00fd9>] vnode_fop_release+0x62/0x7d [vnode]
Jan 15 23:37:39 ldvlinux33   [<c0169952>] __fput+0x55/0xee
Jan 15 23:37:39 ldvlinux33   [<c016825a>] filp_close+0x59/0x5f
Jan 15 23:37:39 ldvlinux33   [<c0123324>] put_files_struct+0x56/0xbf
Jan 15 23:37:39 ldvlinux33   [<c0124317>] do_exit+0x2df/0x59c
Jan 15 23:37:39 ldvlinux33   [<c012476c>] sys_exit_group+0x0/0xd
Jan 15 23:37:39 ldvlinux33   [<c0311443>] syscall_call+0x7/0xb
Jan 15 23:37:39 ldvlinux33   [<c031007b>] rwsem_down_read_failed+0x19f/0x204
Jan 15 23:37:39 ldvlinux33  Code: 38 39 68 3c 75 2d 0f b6 50 40 f6 c2 02 74 09
89 d8 e8 52 d8 ff ff eb 1d f6 c2 20 74 0e ba 02 00 00 00 89 d8 e8 19 e9 ff ff eb
0a <0f> 0b 07 07 94 38 32 c0 89 c3 8b 03 eb c4 b8 00 f0 ff ff 21 e0
Jan 15 23:37:39 ldvlinux33   <0>Fatal exception: panic in 5 seconds
Jan 15 23:37:44 ldvlinux33  Kernel panic - not syncing: Fatal exception


Version-Release number of selected component (if applicable):
Other Kernels Tried:

	1. 	Kernel 2.6.9-34  (Kernel Panic)
	2. 	Kernel 2.6.9-11	 (works fine)
	3. 	Kernel 2.6.9-22	 (Works fine)
	4.      Kernel 2.6.19.2	 (Kernel Panic)

	What we found:
		
	It seems this bug has been introduced in the kernel (2.6.9-34)
	while fixing the bug (160844: dangling POSIX locks after close).

	While I further debug, identified the root cause:


Two types of lock
1. FL_POSIX  is created with calls to fcntl()
2. FL_FLOCK  is created with calls to flock() it is fcntl() in older c library.

In our case when we use 'FL_POSIX' the problem arise.

In the call trace, the 'locks_remove_posix' function(fs/locks.c) is used
to remove the FL_POSIX. Similarly we have another function called
'locks_remove_flock' which is used to remove the FL_FLOCK.

In the Call trace, it makes a call to '__fput function(fs/file_table.c),
which is called from task context when aio completion releases the last
use of a struct file *.

This in turn makes a call to locks_remove_flock(file) 

Here is the code snippet which has introduced the bug:

The bug is the __fput function simply makes a call to
locks_remove_flock(file)' without checking whether it is FL_POSIX or
FL_FLOCK.


 		----------------

fs/file_table.c

	eventpoll_release(file);
        locks_remove_flock(file);

        if (file->f_op && file->f_op->release)
                file->f_op->release(inode, file);  
		----------------------
 
Actually the above code should have been Written like this: (My suggestion)

fs/file_table.c
       
         eventpoll_release(file);       
         if(file->f_op->flock)  //FL_FLOCK      //Just a suggestion, I havent tried.

         {
		 locks_remove_flock(file);
	   }
	   else if(file->f_op->lock)	//FL_POSIX
	   {
		 locks_remove_posix(file);
         }

			OR

fs/locks.c
		Restore the old code as

if (IS_FLOCK(fl) || IS_POSIX(fl)) {  //Tried successfully
                                locks_delete_lock(before);
                                continue;
                        }
                        if (IS_LEASE(fl)) {
                                lease_modify(before, F_UNLCK);
                                continue;
                        }

		Instead of

if (IS_FLOCK(fl))  {     //which has been introduced in 2.6.9-34)
                                locks_delete_lock(before);
                                continue;
                        }
                        if (IS_LEASE(fl)) {
                                lease_modify(before, F_UNLCK);
                                continue;
                        }

Earlier in 2.6.9-11 the locks_remove_flock (fs/locks.c) checks for both
the FL_FLOCK and FL_POSIX as below. So we didn't get any crash. 
  
			 if (IS_FLOCK(fl) || IS_POSIX(fl)) {
                                locks_delete_lock(before);
                                continue;
                        }
                        if (IS_LEASE(fl)) {
                                lease_modify(before, F_UNLCK);
                                continue;
                        }
                        /* What? */
                        BUG();

In 2.6.9-34 & 2.6.9-42(u3 & u4) to fix the 'dangling POSIX locks after
close' they altered the code as follows:

				 if (IS_FLOCK(fl)  {
                                locks_delete_lock(before);
                                continue;
                        }
                        if (IS_LEASE(fl)) {
                                lease_modify(before, F_UNLCK);
                                continue;
                        }
                        /* What? */
                        BUG();

As the lock is FL_POSIX, it simply skipped both the conditions and
reached the BUG(); so the kernel panic.


So the fix could be to introduce the following code snippet in fs/file_table.c

   if(file->f_op->flock)  //FL_FLOCK       //Just a suggestion, I havent tried.
         {
		 locks_remove_flock(file);
	   }
	   else if(file->f_op->lock)	//FL_POSIX
	   {
		 locks_remove_posix(file);
         }

			Or 

	Re-introduce the IS_POSIX() checking in fs/locks.c code:

if (IS_FLOCK(fl) || IS_POSIX(fl)) {  //Tried Successfully
                                locks_delete_lock(before);
                                continue;
                        }
                        if (IS_LEASE(fl)) {
                                lease_modify(before, F_UNLCK);
                                continue;
                        }


Of course we need to check whether it breaks anything else. 


Conclusion:

		In fact I tried the second option. 

		
if (IS_FLOCK(fl) || IS_POSIX(fl)) {
                                locks_delete_lock(before);
                                continue;
                        }
                        if (IS_LEASE(fl)) {
                                lease_modify(before, F_UNLCK);
                                continue;
                        }

 I edited the code fs/locks.c in 2.6.9-42 as above mentioned. It works
fine, with no more kernel panics.

How reproducible:
Only reproducable in house

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Heath Petty 2007-05-23 22:42:01 UTC
Here is the proposed patch for this issue. It has been widely tested by cadence
in their environment and does solve the problem:

--- ./linux-2.6.9-42/fs/locks.c 2007-05-15 19:35:17.000000000 +0530
    +++ ./linux-2.6.9-42CDN1/fs/locks.c 2007-05-15 19:21:02.000000000 +0530
    @@ -1786,7 +1786,7 @@
     
      while ((fl = *before) != NULL) {
       if (fl->fl_file == filp) {
    -   if (IS_FLOCK(fl)) {
    +   if (IS_FLOCK(fl) || IS_POSIX(fl)) {
         locks_delete_lock(before);
         continue;
        }


Comment 3 Jeff Layton 2007-06-01 20:07:13 UTC
Ok, in looking over this briefly, I have some questions that need to be answered
before we can consider this patch.

Given the the stack trace above, locks_remove_posix should have been called on
this filp already when this machine crashed. Why are there any POSIX locks left
at all? That BUG() call would seem to be appropriate to me. There should be no
reason for a function intended to remove flock locks to deal with POSIX locks.

I see 2 possibilities:

1) locks were interated over in the loop in locks_remove_posix, but were not
released for some reason (maybe the  posix_same_owner(fl, &lock) check failed?)

2) locks were slipped into the list during or after the locks_remove_posix call,
but before locks_remove_flock was run.

What might be appropriate is some instrumentation that tries to determine why
these locks were left, though it may be possible to track down the lock in a
core and see if whether the lock owner was correct.

Unfortunately, I'm guessing that these are mvfs locks, and I presume they have
their own lock ops. We'll likely need for IBM to take the lead on this and tell
us why there are still posix locks on the list.


Comment 4 Jeff Layton 2007-06-01 20:28:28 UTC
It's also possible that this is a dupe of bz 211092, which recently had a patch
posted. Have they tested a kernel with that patch?


Comment 5 Jeff Layton 2007-06-01 20:32:49 UTC
Actually this looks identical to bz 240403...

*** This bug has been marked as a duplicate of 240403 ***


Note You need to log in before you can comment on or make changes to this bug.