Description of problem: Kernel crash Dec 5 16:33:55 xxxxx kernel: Unable to handle kernel paging request at virtual address 764b025b Dec 5 16:33:55 xxxxx kernel: printing eip: Dec 5 16:33:55 xxxxx kernel: ecd791bc Dec 5 16:33:55 xxxxx kernel: *pde = 00000000 Dec 5 16:33:55 xxxxx kernel: Oops: 0002 [#1] Dec 5 16:33:55 xxxxx kernel: SMP Dec 5 16:33:55 xxxxx kernel: Modules linked in: md5 ipv6 dm_mirror dm_mod button battery ac uhci_hcd shpchp e1000 floppy ext3 jbd raid1 aic79xx sd_mod scsi_mod Dec 5 16:33:55 xxxxx kernel: CPU: 2 Dec 5 16:33:55 xxxxx kernel: EIP: 0060:[<ecd791bc>] Not tainted VLI Dec 5 16:33:55 xxxxx kernel: EFLAGS: 00010206 (2.6.9-22.0.1.ELsmp) Dec 5 16:33:55 xxxxx kernel: EIP is at 0xecd791bc Dec 5 16:33:55 xxxxx kernel: eax: e31a1100 ebx: e31a1180 ecx: e31a11ff edx: ecd791a4 Dec 5 16:33:55 xxxxx kernel: esi: e31a11f6 edi: c82c6a86 ebp: 00000022 esp: dfcdef70 Dec 5 16:33:55 xxxxx kernel: ds: 007b es: 007b ss: 0068 Dec 5 16:33:55 xxxxx kernel: Process qmail-queue (pid: 16143, threadinfo=dfcde000 task=f7be02b0) Dec 5 16:33:55 xxxxx kernel: Stack: c015943f c82c6a80 07ffffff 00001094 c01234e7 f7be07f0 c82c6a80 f7be02b0 Dec 5 16:33:55 xxxxx kernel: 00000000 c01240fd d8c29580 00000000 dfcde000 dfcde000 c012439f 00000000 Dec 5 16:33:55 xxxxx kernel: 00000000 bfffd6cf bfffd6f0 dfcde000 c02d0fb7 00000000 0804a509 00000481 Dec 5 16:33:56 xxxxx kernel: Call Trace: Dec 5 16:33:56 xxxxx kernel: [<c015943f>] filp_close+0x3b/0x5f Dec 5 16:33:56 xxxxx kernel: [<c01234e7>] put_files_struct+0x57/0xc0 Dec 5 16:33:56 xxxxx kernel: [<c01240fd>] do_exit+0x227/0x3de Dec 5 16:33:56 xxxxx kernel: [<c012439f>] sys_exit_group+0x0/0xd Dec 5 16:33:56 xxxxx kernel: [<c02d0fb7>] syscall_call+0x7/0xb Dec 5 16:33:56 xxxxx kernel: Code: 96 d7 ec 18 65 26 c3 73 71 9e 01 04 00 00 00 f0 91 d7 ec a4 91 d7 ec a4 91 d7 ec 74 96 d7 ec 74 96 d7 ec 2c 0b fc e4 9c a7 fc d7 <30> 93 db f0 30 93 db f0 00 00 00 00 00 00 00 00 00 58 3b c3 00 Dec 5 16:33:56 xxxxx kernel: <0>Fatal exception: panic in 5 seconds Version-Release number of selected component (if applicable): 2.6.9-22.0.1.ELsmp #1 SMP Tue Oct 18 18:39:27 EDT 2005 i686 i686 i386 GNU/Linux How reproducible: Don't know. It crashed. Additional info: lsmod Module Size Used by md5 8001 1 ipv6 240097 58 dm_mirror 28449 0 dm_mod 58949 1 dm_mirror button 10449 0 battery 12869 0 ac 8773 0 uhci_hcd 32729 0 shpchp 84677 0 e1000 96429 0 floppy 58065 0 ext3 118729 3 jbd 59481 1 ext3 raid1 19521 3 aic79xx 187484 8 sd_mod 20545 12 scsi_mod 116429 2 aic79xx,sd_mod
hmmm, eip: ecd791bc suggests a crash in a module. Can you pls post 'cat /proc/modules' so we can see what module is loaded at that address. That will help to narrow down things hopefully. Also, can you please describe a bit more about when the problem, how often, if you've seen it on previous kernels, if so what versions? thanks.
It's the first time I've seen. This server sometimes is a bit loaded and this crash happened on a peak period. cat /proc/modules md5 8001 1 - Live 0xf8865000 ipv6 240097 76 - Live 0xf8ae0000 dm_mirror 28449 0 - Live 0xf88f9000 dm_mod 58949 1 dm_mirror, Live 0xf891b000 button 10449 0 - Live 0xf885d000 battery 12869 0 - Live 0xf88ea000 ac 8773 0 - Live 0xf8861000 uhci_hcd 32729 0 - Live 0xf8868000 shpchp 84677 0 - Live 0xf8881000 e1000 96429 0 - Live 0xf8902000 floppy 58065 0 - Live 0xf8871000 ext3 118729 3 - Live 0xf88c7000 jbd 59481 1 ext3, Live 0xf882f000 raid1 19521 3 - Live 0xf8829000 aic79xx 187484 8 - Live 0xf8898000 sd_mod 20545 12 - Live 0xf8822000 scsi_mod 116429 2 aic79xx,sd_mod, Live 0xf883f000
Second crash. Do you think it might be a hardware problem ? Dec 5 18:53:38 xxxxx kernel: Unable to handle kernel paging request at virtual address b07e7294 Dec 5 18:53:38 xxxxx kernel: printing eip: Dec 5 18:53:38 xxxxx kernel: c01208a2 Dec 5 18:53:38 xxxxx kernel: *pde = 00000000 Dec 5 18:53:38 xxxxx kernel: Oops: 0002 [#1] Dec 5 18:53:38 xxxxx kernel: SMP Dec 5 18:53:38 xxxxx kernel: Modules linked in: md5 ipv6 dm_mirror dm_mod button battery ac uhci_hcd shpchp e1000 floppy ext3 jbd raid1 aic79xx sd_mod scsi_mod Dec 5 18:53:38 xxxxx kernel: CPU: 0 Dec 5 18:53:38 xxxxx kernel: EIP: 0060:[<c01208a2>] Not tainted VLI Dec 5 18:53:38 xxxxx kernel: EFLAGS: 00010282 (2.6.9-22.0.1.ELsmp) Dec 5 18:53:38 xxxxx kernel: EIP is at copy_files+0x1c0/0x2b7 Dec 5 18:53:38 xxxxx kernel: eax: b07e7280 ebx: f88e51b8 ecx: 00000000 edx: 00000433 Dec 5 18:53:38 xxxxx kernel: esi: ed596094 edi: f89311b4 ebp: e8660b80 esp: f2144f10 Dec 5 18:53:38 xxxxx kernel: ds: 007b es: 007b ss: 0068 Dec 5 18:53:38 xxxxx kernel: Process qmail-inject (pid: 7345, threadinfo=f2144000 task=f3605830) Dec 5 18:53:38 xxxxx kernel: Stack: f551b7a4 f551b630 00000094 4394a8d2 000299d3 00000000 000004a0 f89311b4 Dec 5 18:53:38 xxxxx kernel: ebda3280 f551b630 00000000 00004111 00000000 f551b630 c0120d84 f2144fc4 Dec 5 18:53:38 xxxxx kernel: bff73bb0 00004111 f2144fc4 00004111 00000000 00001cb2 c01215a0 00000000 Dec 5 18:53:38 xxxxx kernel: Call Trace: Dec 5 18:53:38 xxxxx kernel: [<c0120d84>] copy_process+0x38a/0xaba Dec 5 18:53:38 xxxxx kernel: [<c01215a0>] do_fork+0x8e/0x175 Dec 5 18:53:38 xxxxx kernel: [<c0104990>] sys_vfork+0x18/0x1c Dec 5 18:53:38 xxxxx kernel: [<c02d0fb7>] syscall_call+0x7/0xb Dec 5 18:53:38 xxxxx kernel: Code: a4 8b 4c 24 20 8b 7d 1c 8b 71 1c 89 d1 f3 a5 a8 02 74 02 66 a5 a8 01 74 01 a4 8b 54 24 18 85 d2 74 1d 8b 03 83 c3 04 85 c0 74 04 <f0> ff 40 14 8b 7c 24 1c 89 07 83 c7 04 4a 89 7c 24 1c eb e1 8b Dec 5 18:53:38 xxxxx kernel: <0>Fatal exception: panic in 5 seconds
One more crash today: Dec 9 09:40:46 xxxxx kernel: Unable to handle kernel paging request at virtual address aaedfca0 Dec 9 09:40:46 xxxxx kernel: printing eip: Dec 9 09:40:46 xxxxx kernel: c0159409 Dec 9 09:40:46 xxxxx kernel: *pde = 00000000 Dec 9 09:40:46 xxxxx kernel: Oops: 0000 [#1] Dec 9 09:40:46 xxxxx kernel: SMP Dec 9 09:40:46 xxxxx kernel: Modules linked in: md5 ipv6 dm_mirror dm_mod button battery ac uhci_hcd shpchp e1000 floppy ext3 jbd raid1 aic79 xx sd_mod scsi_mod Dec 9 09:40:46 xxxxx kernel: CPU: 2 Dec 9 09:40:46 xxxxx kernel: EIP: 0060:[<c0159409>] Not tainted VLI Dec 9 09:40:46 xxxxx kernel: EFLAGS: 00010282 (2.6.9-22.0.1.ELsmp) Dec 9 09:40:46 xxxxx kernel: EIP is at filp_close+0x5/0x5f Dec 9 09:40:46 xxxxx kernel: eax: aaedfc80 ebx: e12c9b00 ecx: aaedfc80 edx: e12c9b00 Dec 9 09:40:46 xxxxx kernel: esi: 07ffffff edi: e12c9b00 ebp: 00000021 esp: c9162f74 Dec 9 09:40:46 xxxxx kernel: ds: 007b es: 007b ss: 0068 Dec 9 09:40:46 xxxxx kernel: Process qmail-queue (pid: 25011, threadinfo=c9162000 task=e18ee2b0) Dec 9 09:40:46 xxxxx kernel: Stack: e12c9b00 07ffffff 00001014 c01234e7 e18ee7f0 e12c9b00 e18ee2b0 00000000 Dec 9 09:40:46 xxxxx kernel: c01240fd c8c81c80 00000000 c9162000 c9162000 c012439f 00000000 00000000 Dec 9 09:40:46 xxxxx kernel: bfe2752f bfe27550 c9162000 c02d0fb7 00000000 0804a509 000005e7 bfe2752f Dec 9 09:40:46 xxxxx kernel: Call Trace: Dec 9 09:40:46 xxxxx kernel: [<c01234e7>] put_files_struct+0x57/0xc0 Dec 9 09:40:46 xxxxx kernel: [<c01240fd>] do_exit+0x227/0x3de Dec 9 09:40:46 xxxxx kernel: [<c012439f>] sys_exit_group+0x0/0xd Dec 9 09:40:46 xxxxx kernel: [<c02d0fb7>] syscall_call+0x7/0xb Dec 9 09:40:46 xxxxx kernel: Code: f0 89 de e8 15 ff ff ff eb cc 5b 89 f0 5e 5f c3 ff 74 24 08 68 41 02 00 00 ff 74 24 0c e8 71 ff ff ff 83 c4 0c c3 57 89 d7 56 53 <8b> 70 20 89 c3 85 f6 74 07 c7 40 20 00 00 00 00 8b 43 14 85 c0 Dec 9 09:40:46 hm319 kernel: <0>Fatal exception: panic in 5 seconds All the crashes occurred on a moment that a user is sending lots of e-mails. I don't know if this helps, but I have a qmail queue (with all the special files) on a filesystem mounted with "usrquota,acl,ext3" options.
It crashed today again Dec 15 17:25:03 xxxxx kernel: Unable to handle kernel NULL pointer dereference at virtual address 0000002d Dec 15 17:25:03 xxxxx kernel: printing eip: Dec 15 17:25:03 xxxxx kernel: c0159434 Dec 15 17:25:03 xxxxx kernel: *pde = 324ca001 Dec 15 17:25:03 xxxxx kernel: Oops: 0000 [#1] Dec 15 17:25:03 xxxxx kernel: SMP Dec 15 17:25:03 xxxxx kernel: Modules linked in: md5 ipv6 dm_mirror dm_mod button battery ac uhci_hcd shpchp e1000 floppy ext3 jbd raid1 aic79 xx sd_mod scsi_mod Dec 15 17:25:03 xxxxx kernel: CPU: 1 Dec 15 17:25:03 xxxxx kernel: EIP: 0060:[<c0159434>] Not tainted VLI Dec 15 17:25:03 xxxxx kernel: EFLAGS: 00010202 (2.6.9-22.0.1.ELsmp) Dec 15 17:25:03 xxxxx kernel: EIP is at filp_close+0x30/0x5f Dec 15 17:25:03 xxxxx kernel: eax: 00000001 ebx: e0684c80 ecx: e0684c80 edx: f1dfe480 Dec 15 17:25:03 xxxxx kernel: esi: dead4ead edi: f1dfe480 ebp: 00000022 esp: f6798f74 Dec 15 17:25:03 xxxxx kernel: ds: 007b es: 007b ss: 0068 Dec 15 17:25:03 xxxxx kernel: Process qmail-queue (pid: 20762, threadinfo=f6798000 task=e9d95330) Dec 15 17:25:03 xxxxx kernel: Stack: f1dfe480 07ffffff 00001094 c01234e7 e9d95870 f1dfe480 e9d95330 00000000 Dec 15 17:25:03 xxxxx kernel: c01240fd f7f4d880 00000000 f6798000 f6798000 c012439f 00000000 00000000 Dec 15 17:25:03 xxxxx kernel: bfe4b09f bfe4b0c0 f6798000 c02d0fb7 00000000 0804a509 00000777 bfe4b09f Dec 15 17:25:03 xxxxx kernel: Call Trace: Dec 15 17:25:03 xxxxx kernel: [<c01234e7>] put_files_struct+0x57/0xc0 Dec 15 17:25:03 xxxxx kernel: [<c01240fd>] do_exit+0x227/0x3de Dec 15 17:25:03 xxxxx kernel: [<c012439f>] sys_exit_group+0x0/0xd Dec 15 17:25:03 xxxxx kernel: [<c02d0fb7>] syscall_call+0x7/0xb Dec 15 17:25:03 xxxxx kernel: [<c02d007b>] __lock_text_end+0x232/0x100f Dec 15 17:25:03 xxxxx kernel: Code: 8b 70 20 89 c3 85 f6 74 07 c7 40 20 00 00 00 00 8b 43 14 85 c0 75 0d 68 70 57 2e c0 e8 00 8e fc ff 58 eb 30 8b 43 10 85 c0 74 10 <8b> 50 2c 85 d2 74 09 89 d8 ff d2 85 f6 0f 44 f0 89 fa 89 d8 e8 Dec 15 17:25:03 xxxxx kernel: <0>Fatal exception: panic in 5 seconds
Based on the second to last crash, filp_close was passed an invalid 'file struct *', aaedfca0, this is not a valid file pointer. It could be hardware, but i wouldn't rule out software too quickly. All the crashes seem to be in the same area manipulating files-3 in filp_close, 1 in copy_process.
We changed the hardware yesterday (we didn´t changed memory yet, but we´ll do today). It crashed even after this. Dec 16 05:10:29 xxxxx kernel: Unable to handle kernel paging request at virtual address 823c2208 Dec 16 05:10:29 xxxxx kernel: printing eip: Dec 16 05:10:29 xxxxx kernel: c0161611 Dec 16 05:10:29 xxxxx kernel: *pde = 00000000 Dec 16 05:10:29 xxxxx kernel: Oops: 0000 [#1] Dec 16 05:10:29 xxxxx kernel: SMP Dec 16 05:10:29 xxxxx kernel: Modules linked in: md5 ipv6 dm_mirror dm_mod button battery ac uhci_hcd hw_random shpchp e1000 floppy ext3 jbd raid1 aic79xx sd_mod scsi_mod Dec 16 05:10:29 xxxxx kernel: CPU: 0 Dec 16 05:10:29 xxxxx kernel: EIP: 0060:[<c0161611>] Not tainted VLI Dec 16 05:10:29 xxxxx kernel: EFLAGS: 00010246 (2.6.9-22.0.1.ELsmp) Dec 16 05:10:29 xxxxx kernel: EIP is at generic_fillattr+0xd/0x99 Dec 16 05:10:29 xxxxx kernel: eax: 823c2200 ebx: e2768f6c ecx: dba2dde8 edx: e2768f6c Dec 16 05:10:29 xxxxx kernel: esi: 00000000 edi: e2768f6c ebp: dba2dde8 esp: e2768ee4 Dec 16 05:10:29 xxxxx kernel: ds: 007b es: 007b ss: 0068 Dec 16 05:10:29 xxxxx kernel: Process find (pid: 5518, threadinfo=e2768000 task=e85550b0) Dec 16 05:10:29 xxxxx kernel: Stack: dba2a5a4 00000000 e2768f6c c01616dd f691ae00 00000000 e2768f6c 09601c58 Dec 16 05:10:29 xxxxx kernel: e2768000 c0161784 dba2a5a4 f691ae00 00000293 00000000 00000000 00000000 Dec 16 05:10:29 xxxxx kernel: 00000001 00000000 00001000 00000008 00000000 4384e486 00000000 4384e486 Dec 16 05:10:29 xxxxx kernel: Call Trace: Dec 16 05:10:29 xxxxx kernel: [<c01616dd>] vfs_getattr+0x40/0x88 Dec 16 05:10:29 xxxxx kernel: [<c0161784>] vfs_lstat+0x25/0x37 Dec 16 05:10:29 xxxxx kernel: [<c0161d65>] sys_lstat64+0xf/0x23 Dec 16 05:10:29 xxxxx kernel: [<c015945d>] filp_close+0x59/0x5f Dec 16 05:10:29 xxxxx kernel: [<c02d0fb7>] syscall_call+0x7/0xb Dec 16 05:10:29 xxxxx kernel: Code: ae e6 fc ff 83 c4 0c 85 c0 7e 0d 53 68 b9 5b 2e c0 e8 9c e6 fc ff 58 5a 5b 31 c0 c3 90 90 57 89 c1 56 53 8b 80 a4 00 00 00 89 d3 <8b> 40 08 89 42 04 8b 41 18 89 02 0f b7 41 20 66 89 42 08 8b 41 Dec 16 05:10:29 xxxxx kernel: <0>Fatal exception: panic in 5 seconds
Here's my take on each of the 5 crashes. Unfortunately, it's not a question of "what happened?", but rather "what led up to what happened?". In any case: ------------------------------------------------------------------------ First crash in filp_close() with bogus EIP of ecd791bc: Unable to handle kernel paging request at virtual address 764b025b printing eip: ecd791bc EIP: 0060:[<ecd791bc>] Not tainted VLI EFLAGS: 00010206 (2.6.9-22.0.1.ELsmp) EIP is at 0xecd791bc eax: e31a1100 ebx: e31a1180 ecx: e31a11ff edx: ecd791a4 esi: e31a11f6 edi: c82c6a86 ebp: 00000022 esp: dfcdef70 ds: 007b es: 007b ss: 0068 Process qmail-queue (pid: 16143, threadinfo=dfcde000 task=f7be02b0) [<c015943f>] filp_close+0x3b/0x5f [<c01234e7>] put_files_struct+0x57/0xc0 [<c01240fd>] do_exit+0x227/0x3de [<c012439f>] sys_exit_group+0x0/0xd [<c02d0fb7>] syscall_call+0x7/0xb The last call, and text return address, left on the stack was the indirect call made in filp_close() c015943d: Dump of assembler code for function filp_close: 0xc0159404 <filp_close+0>: push %edi 0xc0159405 <filp_close+1>: mov %edx,%edi 0xc0159407 <filp_close+3>: push %esi 0xc0159408 <filp_close+4>: push %ebx 0xc0159409 <filp_close+5>: mov 0x20(%eax),%esi 0xc015940c <filp_close+8>: mov %eax,%ebx 0xc015940e <filp_close+10>: test %esi,%esi 0xc0159410 <filp_close+12>: je 0xc0159419 <filp_close+21> 0xc0159412 <filp_close+14>: movl $0x0,0x20(%eax) 0xc0159419 <filp_close+21>: mov 0x14(%ebx),%eax 0xc015941c <filp_close+24>: test %eax,%eax 0xc015941e <filp_close+26>: jne 0xc015942d <filp_close+41> 0xc0159420 <filp_close+28>: push $0xc02e5770 0xc0159425 <filp_close+33>: call 0xc012222a <printk> 0xc015942a <filp_close+38>: pop %eax 0xc015942b <filp_close+39>: jmp 0xc015945d <filp_close+89> 0xc015942d <filp_close+41>: mov 0x10(%ebx),%eax 0xc0159430 <filp_close+44>: test %eax,%eax 0xc0159432 <filp_close+46>: je 0xc0159444 <filp_close+64> 0xc0159434 <filp_close+48>: mov 0x2c(%eax),%edx 0xc0159437 <filp_close+51>: test %edx,%edx 0xc0159439 <filp_close+53>: je 0xc0159444 <filp_close+64> 0xc015943b <filp_close+55>: mov %ebx,%eax 0xc015943d <filp_close+57>: call *%edx 0xc015943f <filp_close+59>: test %esi,%esi <== indirect call should come back here 0xc0159441 <filp_close+61>: cmove %eax,%esi 0xc0159444 <filp_close+64>: mov %edi,%edx 0xc0159446 <filp_close+66>: mov %ebx,%eax 0xc0159448 <filp_close+68>: call 0xc0171199 <dnotify_flush> 0xc015944d <filp_close+73>: mov %ebx,%eax 0xc015944f <filp_close+75>: mov %edi,%edx 0xc0159451 <filp_close+77>: call 0xc016d15e <locks_remove_posix> ... which is the filp->f_op->flush(filp) call: int filp_close(struct file *filp, fl_owner_t id) { int retval; /* Report and clear outstanding errors */ retval = filp->f_error; if (retval) filp->f_error = 0; if (!file_count(filp)) { printk(KERN_ERR "VFS: Close: file count is 0\n"); return retval; } if (filp->f_op && filp->f_op->flush) { int err = filp->f_op->flush(filp); if (!retval) retval = err; } dnotify_flush(filp, id); locks_remove_posix(filp, id); fput(filp); return retval; } So it looks like either: (1) the filp->f_op or f_op->flush field(s) sent it off into the weeds. If there were a flush handler that was in module space, it's certainly gone by the time the filp_close() was called. (2) the flush function was called, and then it called a second function, the second function over-wrote the return addres back to the flush function with the ecd791bc, and the kernel died executing whatever was there. If #2 occurred, the text return address back to the flush function would have been wiped out, so the next one if found was the filp_close() return address. ------------------------------------------------------------------------- Second crash in copy_files(): Unable to handle kernel paging request at virtual address b07e7294 printing eip: c01208a2 CPU: 0 EIP: 0060:[<c01208a2>] Not tainted VLI EFLAGS: 00010282 (2.6.9-22.0.1.ELsmp) EIP is at copy_files+0x1c0/0x2b7 eax: b07e7280 ebx: f88e51b8 ecx: 00000000 edx: 00000433 esi: ed596094 edi: f89311b4 ebp: e8660b80 esp: f2144f10 ds: 007b es: 007b ss: 0068 Process qmail-inject (pid: 7345, threadinfo=f2144000 task=f3605830) [<c0120d84>] copy_process+0x38a/0xaba [<c01215a0>] do_fork+0x8e/0x175 [<c0104990>] sys_vfork+0x18/0x1c [<c02d0fb7>] syscall_call+0x7/0xb The return address is after the copy_files() call: 0xc0120d6a <copy_process+880>: call 0xc019ef7f <copy_semundo> 0xc0120d6f <copy_process+885>: test %eax,%eax 0xc0120d71 <copy_process+887>: mov %eax,%edi 0xc0120d73 <copy_process+889>: jne 0xc01213f8 <copy_process+2558> 0xc0120d79 <copy_process+895>: mov 0x8(%esp),%eax 0xc0120d7d <copy_process+899>: mov %ebp,%edx 0xc0120d7f <copy_process+901>: call 0xc01206e2 <copy_files> 0xc0120d84 <copy_process+906>: test %eax,%eax 0xc0120d86 <copy_process+908>: mov %eax,%edi 0xc0120d88 <copy_process+910>: jne 0xc01213f1 <copy_process+2551> The EIP of c01208a2 is in copy_files(), where it's doing the lock incl: 0xc0120895 <copy_files+435>: test %edx,%edx 0xc0120897 <copy_files+437>: je 0xc01208b6 <copy_files+468> 0xc0120899 <copy_files+439>: mov (%ebx),%eax 0xc012089b <copy_files+441>: add $0x4,%ebx 0xc012089e <copy_files+444>: test %eax,%eax 0xc01208a0 <copy_files+446>: je 0xc01208a6 <copy_files+452> 0xc01208a2 <copy_files+448>: lock incl 0x14(%eax) 0xc01208a6 <copy_files+452>: mov 0x1c(%esp),%edi 0xc01208aa <copy_files+456>: mov %eax,(%edi) 0xc01208ac <copy_files+458>: add $0x4,%edi 0xc01208bf <copy_files+477>: call 0xc02cfb9e <_spin_unlock> which is the get_file() in macro copy_files(): old_fds = oldf->fd; new_fds = newf->fd; memcpy(newf->open_fds->fds_bits, oldf->open_fds->fds_bits, open_files/8); memcpy(newf->close_on_exec->fds_bits, oldf->close_on_exec->fds_bits, open_files/8); for (i = open_files; i != 0; i--) { struct file *f = *old_fds++; if (f) get_file(f); *new_fds++ = f; } spin_unlock(&oldf->file_lock); i.e., get_file() is defined like so: #define get_file(x) atomic_inc(&(x)->f_count) So, while walking through the open file list of the current process, it came across a bogus file structure pointer of b07e7280. ------------------------------------------------------------------------- Third crash in filp_close() again: Unable to handle kernel paging request at virtual address aaedfca0 printing eip: c0159409 EIP: 0060:[<c0159409>] Not tainted VLI EFLAGS: 00010282 (2.6.9-22.0.1.ELsmp) EIP is at filp_close+0x5/0x5f eax: aaedfc80 ebx: e12c9b00 ecx: aaedfc80 edx: e12c9b00 esi: 07ffffff edi: e12c9b00 ebp: 00000021 esp: c9162f74 ds: 007b es: 007b ss: 0068 Process qmail-queue (pid: 25011, threadinfo=c9162000 task=e18ee2b0) [<c01234e7>] put_files_struct+0x57/0xc0 [<c01240fd>] do_exit+0x227/0x3de [<c012439f>] sys_exit_group+0x0/0xd [<c02d0fb7>] syscall_call+0x7/0xb put_files_struct() contains the inline function close_files(), which repeatedly calls filp_close(): static inline void close_files(struct files_struct * files) { int i, j; j = 0; for (;;) { unsigned long set; i = j * __NFDBITS; if (i >= files->max_fdset || i >= files->max_fds) break; set = files->open_fds->fds_bits[j++]; while (set) { if (set & 1) { struct file * file = xchg(&files->fd[i], NULL); if (file) { filp_close(file, files); cond_resched(); } } i++; set >>= 1; } } } ...which again comes across a bogus file structure pointer aaedfc80, and passes it to filp_close(), which crashes when referencing it: (gdb) disass filp_close Dump of assembler code for function filp_close: 0xc0159404 <filp_close+0>: push %edi 0xc0159405 <filp_close+1>: mov %edx,%edi 0xc0159407 <filp_close+3>: push %esi 0xc0159408 <filp_close+4>: push %ebx 0xc0159409 <filp_close+5>: mov 0x20(%eax),%esi ------------------------------------------------------------------------- Fourth crash in filp_close(): Unable to handle kernel NULL pointer dereference at virtual address 0000002d printing eip: c0159434 EIP: 0060:[<c0159434>] Not tainted VLI EFLAGS: 00010202 (2.6.9-22.0.1.ELsmp) EIP is at filp_close+0x30/0x5f eax: 00000001 ebx: e0684c80 ecx: e0684c80 edx: f1dfe480 esi: dead4ead edi: f1dfe480 ebp: 00000022 esp: f6798f74 ds: 007b es: 007b ss: 0068 Process qmail-queue (pid: 20762, threadinfo=f6798000 task=e9d95330) [<c01234e7>] put_files_struct+0x57/0xc0 [<c01240fd>] do_exit+0x227/0x3de [<c012439f>] sys_exit_group+0x0/0xd [<c02d0fb7>] syscall_call+0x7/0xb Here again, put_files_struct() has called filp_close() with an invalid, yet readable, value of e0684c80, and got as far as the filp->f_op->flush dereference, where filp->f_op contains a value of 1: if (filp->f_op && filp->f_op->flush) { int err = filp->f_op->flush(filp); if (!retval) retval = err; } ------------------------------------------------------------------------- Fifth crash in generic_fillattr(): Unable to handle kernel paging request at virtual address 823c2208 printing eip: c0161611 EIP: 0060:[<c0161611>] Not tainted VLI EFLAGS: 00010246 (2.6.9-22.0.1.ELsmp) EIP is at generic_fillattr+0xd/0x99 eax: 823c2200 ebx: e2768f6c ecx: dba2dde8 edx: e2768f6c esi: 00000000 edi: e2768f6c ebp: dba2dde8 esp: e2768ee4 ds: 007b es: 007b ss: 0068 Process find (pid: 5518, threadinfo=e2768000 task=e85550b0) [<c01616dd>] vfs_getattr+0x40/0x88 [<c0161784>] vfs_lstat+0x25/0x37 [<c0161d65>] sys_lstat64+0xf/0x23 [<c015945d>] filp_close+0x59/0x5f [<c02d0fb7>] syscall_call+0x7/0xb where the "find" process is doing an lstat on an inode. vfs_gettattr() calls generic_fillattr() if the inode's i_op doesn't have its own getattr function: int vfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat) { struct inode *inode = dentry->d_inode; int retval; retval = security_inode_getattr(mnt, dentry); if (retval) return retval; if (inode->i_op->getattr) return inode->i_op->getattr(mnt, dentry, stat); generic_fillattr(inode, stat); if (!stat->blksize) { struct super_block *s = inode->i_sb; unsigned blocks; blocks = (stat->size+s->s_blocksize-1) >> s->s_blocksize_bits; stat->blocks = (s->s_blocksize / 512) * blocks; stat->blksize = s->s_blocksize; } return 0; } generic_fillattr() does just that, filling in the passed-in kstat structure: void generic_fillattr(struct inode *inode, struct kstat *stat) { stat->dev = inode->i_sb->s_dev; stat->ino = inode->i_ino; stat->mode = inode->i_mode; stat->nlink = inode->i_nlink; stat->uid = inode->i_uid; stat->gid = inode->i_gid; stat->rdev = inode->i_rdev; stat->atime = inode->i_atime; stat->mtime = inode->i_mtime; stat->ctime = inode->i_ctime; stat->size = i_size_read(inode); stat->blocks = inode->i_blocks; stat->blksize = inode->i_blksize; } The inode value passed in was dba2dde8, probably bogus, but readable. But it's i_sb value 823c2200, which when dereferenced as i_sb->s_dev crashed. So in this case, the question is where the inode reference came from, based upon the dentry found by user_path_walk_link(): int vfs_lstat(char __user *name, struct kstat *stat) { struct nameidata nd; int error; error = user_path_walk_link(name, &nd); if (!error) { error = vfs_getattr(nd.mnt, nd.dentry, stat); path_release(&nd); } return error; } ------------------------------------------------------------------------- In all of 5 cases, there's not nearly enough information to determine what led up to the corrupt file-related structures, i.e., the system "context" at the time of the crashes. In order to continue debugging this, we're going to need a netdump or diskdump, whichever you are capable of setting up. With that extra data, hopefully we can gather enough further evidence to figure out what happened.
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.