From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040113 Description of problem: There are inconsistencies in the proc filesystem's proc_dir_entry that can cause a proc_dir_entry to be freed while some other process still holds a reference to it. This is being seen as panics with the following fingerprint: Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: 0229c38f *pde = 00003001 *pte = 00000000 Oops: 0000 netconsole parport_pc lp parport autofs tg3 floppy sg keybdev mousedev hid input usb-uhci usbcore ext3 jbd ips qla2300 aic7xxx sd_mod scsi_mod CPU: 4 EIP: 0060:[<0229c38f>] Not tainted EFLAGS: 00010097 EIP is at vsnprintf [kernel] 0x2df (2.4.21-15.0.2.ELhugemem/i686) eax: 00000004 ebx: 0000000a ecx: 00000004 edx: fffffffe esi: 02472f6e edi: 00000000 ebp: 0247335f esp: d68cfedc ds: 0068 es: 0068 ss: 0068 Process atop (pid: 12106, stackpage=d68cf000) Stack: 00006807 00000006 00017201 0210ade2 00000000 00000000 00000011 00000002 ffffffff ffffffff 92cd0000 00000246 1120b800 a33d0a80 02128dee 02472f60 00000400 022b8063 d68cff40 92cd0000 d68ce000 1120b800 0218f399 022b8054 Call Trace: [<0210ade2>] __down [kernel] 0x72 (0xd68cfee8) [<02128dee>] printk [kernel] 0x7e (0xd68cff14) [<0218f399>] de_put [kernel] 0xa9 (0xd68cff34) [<0218f3c0>] proc_delete_inode [kernel] 0x0 (0xd68cff44) [<0217d8ef>] iput [kernel] 0x14f (0xd68cff48) [<0217a749>] dput [kernel] 0xc9 (0xd68cff64) [<0216289a>] __fput [kernel] 0xba (0xd68cff78) [<02160bae>] filp_close [kernel] 0x8e (0xd68cff94) [<02160c56>] sys_close [kernel] 0x66 (0xd68cffb0) The following bit of code is responsible for this particular panic: static void de_put(struct proc_dir_entry *de) { if (de) { lock_kernel(); if (!atomic_read(&de->count)) { printk("de_put: entry %s already free!\n", de->name); unlock_kernel(); return; } There are a couple of problems with this particular check. If the entry has indeed already been freed, then any structure elements within it should not be referenced as they are by definition no longer valid. At best, you end up with messages like the following printed to the console: de_put: entry <NULL> already free! Worse, if the new value of what was once de->name is not null, the system panics. The potentially nasty situation comes in when the entry has been freed, but the new value of what was once de->count is not 0. Then, the next line will actually corrupt someone else's data and potentially free it: if (atomic_dec_and_test(&de->count)) { if (de->deleted) { printk("de_put: deferred delete of %s\n", de->name); free_proc_entry(de); } } Version-Release number of selected component (if applicable): 2.4.21-15.0.2.ELhugemem/i686 How reproducible: Always Steps to Reproduce: 1. Start up a large number of processes (in the environment that the panic has been seen, 12000 processes were started. 2. Run atop. 3. Stop the processes. Actual Results: Top appears to race with the deletion of the associated PID proc entries, and the system panics. Expected Results: No panic ;-) Additional info:
Dave, This issue also reported via support channel.... The difficulty here is that this piece of code is terribly designed and implemented. First, the proc_dir_entry is allocated via "malloc". That's a very bad start - should use a fixed set of slab cache objects rather than a general purpose one (I think). Now let's examine one possible race condition: thread A calls remove_proc_entry() thread A does an atomic_read(&de->count) and finds it is zero thread B calls proc_get_inode() and got the "de" pionter thread A does a "free_proc_entry" - which does a free(de). ("de" now "returns" to malloc pool - God knows who gets it and what it has done to that piece of memory) thread B still has the "de" so it'll trash someone's memory by either thread B does de_get() which incrases de->count (trashes someone's memory) or. thread B does a de_put() which'll either crash and/or trashes someone's memory. The "de" needs to get from a fixed set of slab cache and we need a lock to protect the structure. What do you think ?
First of all, we shouldn't be using any proc_dir_entry for per-process stuff. What's more, exclusion should *prevent* proc_get_inode() after we have removed the entry from lists, which happens before the final de_put(). BKL used to provide that exclusion, IIRC. Whether it still does so or not is the question. Special slab vs. kmalloc() is a red herring - either we get it right (in which case it doesn't matter) or we do not (in which case nasty stuff will happen anyway).
A fix for this problem has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-20.11.EL).
An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-550.html