175016 – Kernel crash unexpectedly - latest version 2.6.9-22.0.1.ELsmp

Bug 175016 - Kernel crash unexpectedly - latest version 2.6.9-22.0.1.ELsmp

Summary: Kernel crash unexpectedly - latest version 2.6.9-22.0.1.ELsmp

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Dave Anderson
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-12-05 18:59 UTC by Renato
Modified:	2012-06-20 16:09 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-06-20 16:09:33 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Renato 2005-12-05 18:59:58 UTC

Description of problem:

Kernel crash

Dec  5 16:33:55 xxxxx kernel: Unable to handle kernel paging request at virtual 
address 764b025b
Dec  5 16:33:55 xxxxx kernel:  printing eip:
Dec  5 16:33:55 xxxxx kernel: ecd791bc
Dec  5 16:33:55 xxxxx kernel: *pde = 00000000
Dec  5 16:33:55 xxxxx kernel: Oops: 0002 [#1]
Dec  5 16:33:55 xxxxx kernel: SMP
Dec  5 16:33:55 xxxxx kernel: Modules linked in: md5 ipv6 dm_mirror dm_mod 
button battery ac uhci_hcd shpchp e1000 floppy ext3 jbd raid1 aic79xx sd_mod 
scsi_mod
Dec  5 16:33:55 xxxxx kernel: CPU:    2
Dec  5 16:33:55 xxxxx kernel: EIP:    0060:[<ecd791bc>]    Not tainted VLI
Dec  5 16:33:55 xxxxx kernel: EFLAGS: 00010206   (2.6.9-22.0.1.ELsmp)
Dec  5 16:33:55 xxxxx kernel: EIP is at 0xecd791bc
Dec  5 16:33:55 xxxxx kernel: eax: e31a1100   ebx: e31a1180   ecx: e31a11ff   
edx: ecd791a4
Dec  5 16:33:55 xxxxx kernel: esi: e31a11f6   edi: c82c6a86   ebp: 00000022   
esp: dfcdef70
Dec  5 16:33:55 xxxxx kernel: ds: 007b   es: 007b   ss: 0068
Dec  5 16:33:55 xxxxx kernel: Process qmail-queue (pid: 16143, 
threadinfo=dfcde000 task=f7be02b0)
Dec  5 16:33:55 xxxxx kernel: Stack: c015943f c82c6a80 07ffffff 00001094 
c01234e7 f7be07f0 c82c6a80 f7be02b0
Dec  5 16:33:55 xxxxx kernel:        00000000 c01240fd d8c29580 00000000 
dfcde000 dfcde000 c012439f 00000000
Dec  5 16:33:55 xxxxx kernel:        00000000 bfffd6cf bfffd6f0 dfcde000 
c02d0fb7 00000000 0804a509 00000481
Dec  5 16:33:56 xxxxx kernel: Call Trace:
Dec  5 16:33:56 xxxxx kernel:  [<c015943f>] filp_close+0x3b/0x5f
Dec  5 16:33:56 xxxxx kernel:  [<c01234e7>] put_files_struct+0x57/0xc0
Dec  5 16:33:56 xxxxx kernel:  [<c01240fd>] do_exit+0x227/0x3de
Dec  5 16:33:56 xxxxx kernel:  [<c012439f>] sys_exit_group+0x0/0xd
Dec  5 16:33:56 xxxxx kernel:  [<c02d0fb7>] syscall_call+0x7/0xb
Dec  5 16:33:56 xxxxx kernel: Code: 96 d7 ec 18 65 26 c3 73 71 9e 01 04 00 00 
00 f0 91 d7 ec a4 91 d7 ec a4 91 d7 ec 74 96 d7 ec 74 96 d7 ec 2c 0b fc e4 9c 
a7 fc d7 <30> 93 db f0 30 93 db f0 00 00 00 00 00 00 00 00 00 58 3b c3 00
Dec  5 16:33:56 xxxxx kernel:  <0>Fatal exception: panic in 5 seconds

Version-Release number of selected component (if applicable):

2.6.9-22.0.1.ELsmp #1 SMP Tue Oct 18 18:39:27 EDT 2005 i686 i686 i386 GNU/Linux

How reproducible:

Don't know. It crashed. 

Additional info:

lsmod
Module                  Size  Used by
md5                     8001  1 
ipv6                  240097  58 
dm_mirror              28449  0 
dm_mod                 58949  1 dm_mirror
button                 10449  0 
battery                12869  0 
ac                      8773  0 
uhci_hcd               32729  0 
shpchp                 84677  0 
e1000                  96429  0 
floppy                 58065  0 
ext3                  118729  3 
jbd                    59481  1 ext3
raid1                  19521  3 
aic79xx               187484  8 
sd_mod                 20545  12 
scsi_mod              116429  2 aic79xx,sd_mod

Comment 1 Jason Baron 2005-12-05 19:59:19 UTC

hmmm, eip: ecd791bc suggests a crash in a module. Can you pls post 'cat
/proc/modules' so we can see what module is loaded at that address. That will
help to narrow down things hopefully. Also, can you please describe a bit more
about when the problem, how often, if you've seen it on previous kernels, if so
what versions? thanks.

Comment 2 Renato 2005-12-05 20:03:58 UTC

It's the first time I've seen. This server sometimes is a bit loaded and this 
crash happened on a peak period.

cat /proc/modules 

md5 8001 1 - Live 0xf8865000
ipv6 240097 76 - Live 0xf8ae0000
dm_mirror 28449 0 - Live 0xf88f9000
dm_mod 58949 1 dm_mirror, Live 0xf891b000
button 10449 0 - Live 0xf885d000
battery 12869 0 - Live 0xf88ea000
ac 8773 0 - Live 0xf8861000
uhci_hcd 32729 0 - Live 0xf8868000
shpchp 84677 0 - Live 0xf8881000
e1000 96429 0 - Live 0xf8902000
floppy 58065 0 - Live 0xf8871000
ext3 118729 3 - Live 0xf88c7000
jbd 59481 1 ext3, Live 0xf882f000
raid1 19521 3 - Live 0xf8829000
aic79xx 187484 8 - Live 0xf8898000
sd_mod 20545 12 - Live 0xf8822000
scsi_mod 116429 2 aic79xx,sd_mod, Live 0xf883f000

Comment 3 Renato 2005-12-05 21:02:32 UTC

Second crash. Do you think it might be a hardware problem ?

Dec  5 18:53:38 xxxxx kernel: Unable to handle kernel paging request at virtual 
address b07e7294
Dec  5 18:53:38 xxxxx kernel:  printing eip:
Dec  5 18:53:38 xxxxx kernel: c01208a2
Dec  5 18:53:38 xxxxx kernel: *pde = 00000000
Dec  5 18:53:38 xxxxx kernel: Oops: 0002 [#1]
Dec  5 18:53:38 xxxxx kernel: SMP
Dec  5 18:53:38 xxxxx kernel: Modules linked in: md5 ipv6 dm_mirror dm_mod 
button battery ac uhci_hcd shpchp e1000 floppy ext3 jbd raid1 aic79xx sd_mod 
scsi_mod
Dec  5 18:53:38 xxxxx kernel: CPU:    0
Dec  5 18:53:38 xxxxx kernel: EIP:    0060:[<c01208a2>]    Not tainted VLI
Dec  5 18:53:38 xxxxx kernel: EFLAGS: 00010282   (2.6.9-22.0.1.ELsmp)
Dec  5 18:53:38 xxxxx kernel: EIP is at copy_files+0x1c0/0x2b7
Dec  5 18:53:38 xxxxx kernel: eax: b07e7280   ebx: f88e51b8   ecx: 00000000   
edx: 00000433
Dec  5 18:53:38 xxxxx kernel: esi: ed596094   edi: f89311b4   ebp: e8660b80   
esp: f2144f10
Dec  5 18:53:38 xxxxx kernel: ds: 007b   es: 007b   ss: 0068
Dec  5 18:53:38 xxxxx kernel: Process qmail-inject (pid: 7345, 
threadinfo=f2144000 task=f3605830)
Dec  5 18:53:38 xxxxx kernel: Stack: f551b7a4 f551b630 00000094 4394a8d2 
000299d3 00000000 000004a0 f89311b4
Dec  5 18:53:38 xxxxx kernel:        ebda3280 f551b630 00000000 00004111 
00000000 f551b630 c0120d84 f2144fc4
Dec  5 18:53:38 xxxxx kernel:        bff73bb0 00004111 f2144fc4 00004111 
00000000 00001cb2 c01215a0 00000000
Dec  5 18:53:38 xxxxx kernel: Call Trace:
Dec  5 18:53:38 xxxxx kernel:  [<c0120d84>] copy_process+0x38a/0xaba
Dec  5 18:53:38 xxxxx kernel:  [<c01215a0>] do_fork+0x8e/0x175
Dec  5 18:53:38 xxxxx kernel:  [<c0104990>] sys_vfork+0x18/0x1c
Dec  5 18:53:38 xxxxx kernel:  [<c02d0fb7>] syscall_call+0x7/0xb
Dec  5 18:53:38 xxxxx kernel: Code: a4 8b 4c 24 20 8b 7d 1c 8b 71 1c 89 d1 f3 
a5 a8 02 74 02 66 a5 a8 01 74 01 a4 8b 54 24 18 85 d2 74 1d 8b 03 83 c3 04 85 
c0 74 04 <f0> ff 40 14 8b 7c 24 1c 89 07 83 c7 04 4a 89 7c 24 1c eb e1 8b
Dec  5 18:53:38 xxxxx kernel:  <0>Fatal exception: panic in 5 seconds

Comment 4 Renato 2005-12-09 11:56:19 UTC

One more crash today:

Dec  9 09:40:46 xxxxx kernel: Unable to handle kernel paging request at virtual 
address aaedfca0
Dec  9 09:40:46 xxxxx kernel:  printing eip:
Dec  9 09:40:46 xxxxx kernel: c0159409
Dec  9 09:40:46 xxxxx kernel: *pde = 00000000
Dec  9 09:40:46 xxxxx kernel: Oops: 0000 [#1]
Dec  9 09:40:46 xxxxx kernel: SMP
Dec  9 09:40:46 xxxxx kernel: Modules linked in: md5 ipv6 dm_mirror dm_mod 
button battery ac uhci_hcd shpchp e1000 floppy ext3 jbd raid1 aic79
xx sd_mod scsi_mod
Dec  9 09:40:46 xxxxx kernel: CPU:    2
Dec  9 09:40:46 xxxxx kernel: EIP:    0060:[<c0159409>]    Not tainted VLI
Dec  9 09:40:46 xxxxx kernel: EFLAGS: 00010282   (2.6.9-22.0.1.ELsmp) 
Dec  9 09:40:46 xxxxx kernel: EIP is at filp_close+0x5/0x5f   
Dec  9 09:40:46 xxxxx kernel: eax: aaedfc80   ebx: e12c9b00   ecx: aaedfc80   
edx: e12c9b00
Dec  9 09:40:46 xxxxx kernel: esi: 07ffffff   edi: e12c9b00   ebp: 00000021   
esp: c9162f74
Dec  9 09:40:46 xxxxx kernel: ds: 007b   es: 007b   ss: 0068
Dec  9 09:40:46 xxxxx kernel: Process qmail-queue (pid: 25011, 
threadinfo=c9162000 task=e18ee2b0)
Dec  9 09:40:46 xxxxx kernel: Stack: e12c9b00 07ffffff 00001014 c01234e7 
e18ee7f0 e12c9b00 e18ee2b0 00000000
Dec  9 09:40:46 xxxxx kernel:        c01240fd c8c81c80 00000000 c9162000 
c9162000 c012439f 00000000 00000000
Dec  9 09:40:46 xxxxx kernel:        bfe2752f bfe27550 c9162000 c02d0fb7 
00000000 0804a509 000005e7 bfe2752f
Dec  9 09:40:46 xxxxx kernel: Call Trace:
Dec  9 09:40:46 xxxxx kernel:  [<c01234e7>] put_files_struct+0x57/0xc0
Dec  9 09:40:46 xxxxx kernel:  [<c01240fd>] do_exit+0x227/0x3de
Dec  9 09:40:46 xxxxx kernel:  [<c012439f>] sys_exit_group+0x0/0xd
Dec  9 09:40:46 xxxxx kernel:  [<c02d0fb7>] syscall_call+0x7/0xb
Dec  9 09:40:46 xxxxx kernel: Code: f0 89 de e8 15 ff ff ff eb cc 5b 89 f0 5e 
5f c3 ff 74 24 08 68 41 02 00 00 ff 74 24 0c e8 71 ff ff ff 83 c4 0c c3 57 89 
d7 56 53 <8b> 70 20 89 c3 85 f6 74 07 c7 40 20 00 00 00 00 8b 43 14 85 c0
Dec  9 09:40:46 hm319 kernel:  <0>Fatal exception: panic in 5 seconds

All the crashes occurred on a moment that a user is sending lots of e-mails. I 
don't know if this helps, but I have a qmail queue (with all the special files) 
on a filesystem mounted with "usrquota,acl,ext3" options.

Comment 5 Renato 2005-12-15 19:37:55 UTC

It crashed today again

Dec 15 17:25:03 xxxxx kernel: Unable to handle kernel NULL pointer dereference 
at virtual address 0000002d
Dec 15 17:25:03 xxxxx kernel:  printing eip:
Dec 15 17:25:03 xxxxx kernel: c0159434
Dec 15 17:25:03 xxxxx kernel: *pde = 324ca001
Dec 15 17:25:03 xxxxx kernel: Oops: 0000 [#1]
Dec 15 17:25:03 xxxxx kernel: SMP 
Dec 15 17:25:03 xxxxx kernel: Modules linked in: md5 ipv6 dm_mirror dm_mod 
button battery ac uhci_hcd shpchp e1000 floppy ext3 jbd raid1 aic79
xx sd_mod scsi_mod
Dec 15 17:25:03 xxxxx kernel: CPU:    1
Dec 15 17:25:03 xxxxx kernel: EIP:    0060:[<c0159434>]    Not tainted VLI
Dec 15 17:25:03 xxxxx kernel: EFLAGS: 00010202   (2.6.9-22.0.1.ELsmp)
Dec 15 17:25:03 xxxxx kernel: EIP is at filp_close+0x30/0x5f 
Dec 15 17:25:03 xxxxx kernel: eax: 00000001   ebx: e0684c80   ecx: e0684c80   
edx: f1dfe480
Dec 15 17:25:03 xxxxx kernel: esi: dead4ead   edi: f1dfe480   ebp: 00000022   
esp: f6798f74
Dec 15 17:25:03 xxxxx kernel: ds: 007b   es: 007b   ss: 0068
Dec 15 17:25:03 xxxxx kernel: Process qmail-queue (pid: 20762, 
threadinfo=f6798000 task=e9d95330)
Dec 15 17:25:03 xxxxx kernel: Stack: f1dfe480 07ffffff 00001094 c01234e7 
e9d95870 f1dfe480 e9d95330 00000000
Dec 15 17:25:03 xxxxx kernel:        c01240fd f7f4d880 00000000 f6798000 
f6798000 c012439f 00000000 00000000
Dec 15 17:25:03 xxxxx kernel:        bfe4b09f bfe4b0c0 f6798000 c02d0fb7 
00000000 0804a509 00000777 bfe4b09f
Dec 15 17:25:03 xxxxx kernel: Call Trace:
Dec 15 17:25:03 xxxxx kernel:  [<c01234e7>] put_files_struct+0x57/0xc0
Dec 15 17:25:03 xxxxx kernel:  [<c01240fd>] do_exit+0x227/0x3de
Dec 15 17:25:03 xxxxx kernel:  [<c012439f>] sys_exit_group+0x0/0xd
Dec 15 17:25:03 xxxxx kernel:  [<c02d0fb7>] syscall_call+0x7/0xb
Dec 15 17:25:03 xxxxx kernel:  [<c02d007b>] __lock_text_end+0x232/0x100f
Dec 15 17:25:03 xxxxx kernel: Code: 8b 70 20 89 c3 85 f6 74 07 c7 40 20 00 00 
00 00 8b 43 14 85 c0 75 0d 68 70 57 2e c0 e8 00 8e fc ff 58 eb 30 8b 43 10 85 
c0 74 10 <8b> 50 2c 85 d2 74 09 89 d8 ff d2 85 f6 0f 44 f0 89 fa 89 d8 e8
Dec 15 17:25:03 xxxxx kernel:  <0>Fatal exception: panic in 5 seconds

Comment 6 Jason Baron 2005-12-16 14:16:48 UTC

Based on the second to last crash, filp_close was passed an invalid 'file struct
*', aaedfca0, this is not a valid file pointer. It could be hardware, but i
wouldn't rule out software too quickly. All the crashes seem to be in the same
area manipulating files-3 in filp_close, 1 in copy_process.

Comment 7 Renato 2005-12-16 14:36:22 UTC

We changed the hardware yesterday (we didnÂ´t changed memory yet, but weÂ´ll do 
today). It crashed even after this.

Dec 16 05:10:29 xxxxx kernel: Unable to handle kernel paging request at virtual 
address 823c2208
Dec 16 05:10:29 xxxxx kernel:  printing eip:
Dec 16 05:10:29 xxxxx kernel: c0161611
Dec 16 05:10:29 xxxxx kernel: *pde = 00000000
Dec 16 05:10:29 xxxxx kernel: Oops: 0000 [#1]
Dec 16 05:10:29 xxxxx kernel: SMP
Dec 16 05:10:29 xxxxx kernel: Modules linked in: md5 ipv6 dm_mirror dm_mod 
button battery ac uhci_hcd hw_random shpchp e1000 floppy ext3 jbd raid1 aic79xx 
sd_mod scsi_mod
Dec 16 05:10:29 xxxxx kernel: CPU:    0
Dec 16 05:10:29 xxxxx kernel: EIP:    0060:[<c0161611>]    Not tainted VLI
Dec 16 05:10:29 xxxxx kernel: EFLAGS: 00010246   (2.6.9-22.0.1.ELsmp)
Dec 16 05:10:29 xxxxx kernel: EIP is at generic_fillattr+0xd/0x99
Dec 16 05:10:29 xxxxx kernel: eax: 823c2200   ebx: e2768f6c   ecx: dba2dde8   
edx: e2768f6c
Dec 16 05:10:29 xxxxx kernel: esi: 00000000   edi: e2768f6c   ebp: dba2dde8   
esp: e2768ee4
Dec 16 05:10:29 xxxxx kernel: ds: 007b   es: 007b   ss: 0068
Dec 16 05:10:29 xxxxx kernel: Process find (pid: 5518, threadinfo=e2768000 
task=e85550b0)
Dec 16 05:10:29 xxxxx kernel: Stack: dba2a5a4 00000000 e2768f6c c01616dd 
f691ae00 00000000 e2768f6c 09601c58
Dec 16 05:10:29 xxxxx kernel:        e2768000 c0161784 dba2a5a4 f691ae00 
00000293 00000000 00000000 00000000
Dec 16 05:10:29 xxxxx kernel:        00000001 00000000 00001000 00000008 
00000000 4384e486 00000000 4384e486
Dec 16 05:10:29 xxxxx kernel: Call Trace:
Dec 16 05:10:29 xxxxx kernel:  [<c01616dd>] vfs_getattr+0x40/0x88
Dec 16 05:10:29 xxxxx kernel:  [<c0161784>] vfs_lstat+0x25/0x37
Dec 16 05:10:29 xxxxx kernel:  [<c0161d65>] sys_lstat64+0xf/0x23
Dec 16 05:10:29 xxxxx kernel:  [<c015945d>] filp_close+0x59/0x5f
Dec 16 05:10:29 xxxxx kernel:  [<c02d0fb7>] syscall_call+0x7/0xb
Dec 16 05:10:29 xxxxx kernel: Code: ae e6 fc ff 83 c4 0c 85 c0 7e 0d 53 68 b9 
5b 2e c0 e8 9c e6 fc ff 58 5a 5b 31 c0 c3 90 90 57 89 c1 56 53 8b 80 a4 00 00 
00 89 d3 <8b> 40 08 89 42 04 8b 41 18 89 02 0f b7 41 20 66 89 42 08 8b 41
Dec 16 05:10:29 xxxxx kernel:  <0>Fatal exception: panic in 5 seconds

Comment 8 Dave Anderson 2005-12-16 16:38:15 UTC

Here's my take on each of the 5 crashes.  Unfortunately, it's not
a question of "what happened?", but rather "what led up to what happened?".
In any case:

------------------------------------------------------------------------

First crash in filp_close() with bogus EIP of ecd791bc:

  Unable to handle kernel paging request at virtual address 764b025b
  
  printing eip: ecd791bc
  
  EIP:    0060:[<ecd791bc>]    Not tainted VLI
  EFLAGS: 00010206   (2.6.9-22.0.1.ELsmp)
  EIP is at 0xecd791bc
  eax: e31a1100   ebx: e31a1180   ecx: e31a11ff   edx: ecd791a4
  esi: e31a11f6   edi: c82c6a86   ebp: 00000022   esp: dfcdef70
  ds: 007b   es: 007b   ss: 0068

Process qmail-queue (pid: 16143, threadinfo=dfcde000 task=f7be02b0)

  [<c015943f>] filp_close+0x3b/0x5f
  [<c01234e7>] put_files_struct+0x57/0xc0
  [<c01240fd>] do_exit+0x227/0x3de
  [<c012439f>] sys_exit_group+0x0/0xd
  [<c02d0fb7>] syscall_call+0x7/0xb

The last call, and text return address, left on the stack was
the indirect call made in filp_close() c015943d:

  Dump of assembler code for function filp_close:
  0xc0159404 <filp_close+0>:      push   %edi
  0xc0159405 <filp_close+1>:      mov    %edx,%edi
  0xc0159407 <filp_close+3>:      push   %esi
  0xc0159408 <filp_close+4>:      push   %ebx
  0xc0159409 <filp_close+5>:      mov    0x20(%eax),%esi
  0xc015940c <filp_close+8>:      mov    %eax,%ebx
  0xc015940e <filp_close+10>:     test   %esi,%esi
  0xc0159410 <filp_close+12>:     je     0xc0159419 <filp_close+21>
  0xc0159412 <filp_close+14>:     movl   $0x0,0x20(%eax)
  0xc0159419 <filp_close+21>:     mov    0x14(%ebx),%eax
  0xc015941c <filp_close+24>:     test   %eax,%eax
  0xc015941e <filp_close+26>:     jne    0xc015942d <filp_close+41>
  0xc0159420 <filp_close+28>:     push   $0xc02e5770
  0xc0159425 <filp_close+33>:     call   0xc012222a <printk>
  0xc015942a <filp_close+38>:     pop    %eax
  0xc015942b <filp_close+39>:     jmp    0xc015945d <filp_close+89>
  0xc015942d <filp_close+41>:     mov    0x10(%ebx),%eax
  0xc0159430 <filp_close+44>:     test   %eax,%eax
  0xc0159432 <filp_close+46>:     je     0xc0159444 <filp_close+64>
  0xc0159434 <filp_close+48>:     mov    0x2c(%eax),%edx
  0xc0159437 <filp_close+51>:     test   %edx,%edx
  0xc0159439 <filp_close+53>:     je     0xc0159444 <filp_close+64>
  0xc015943b <filp_close+55>:     mov    %ebx,%eax
  0xc015943d <filp_close+57>:     call   *%edx
  0xc015943f <filp_close+59>:     test   %esi,%esi   <== indirect call should
come back here
  0xc0159441 <filp_close+61>:     cmove  %eax,%esi
  0xc0159444 <filp_close+64>:     mov    %edi,%edx
  0xc0159446 <filp_close+66>:     mov    %ebx,%eax
  0xc0159448 <filp_close+68>:     call   0xc0171199 <dnotify_flush>
  0xc015944d <filp_close+73>:     mov    %ebx,%eax
  0xc015944f <filp_close+75>:     mov    %edi,%edx
  0xc0159451 <filp_close+77>:     call   0xc016d15e <locks_remove_posix>
  ...

which is the filp->f_op->flush(filp) call:

  int filp_close(struct file *filp, fl_owner_t id)
  {
          int retval;
  
          /* Report and clear outstanding errors */
          retval = filp->f_error;
          if (retval)
                  filp->f_error = 0;
  
          if (!file_count(filp)) {
                  printk(KERN_ERR "VFS: Close: file count is 0\n");
                  return retval;
          }
  
          if (filp->f_op && filp->f_op->flush) {
                  int err = filp->f_op->flush(filp);
                  if (!retval)
                          retval = err;
          }
  
          dnotify_flush(filp, id);
          locks_remove_posix(filp, id);
          fput(filp);
          return retval;
  }

So it looks like either:

(1) the filp->f_op or f_op->flush field(s) sent it off into the weeds.  
    If there were a flush handler that was in module space, it's certainly 
    gone by the time the filp_close() was called.
(2) the flush function was called, and then it called a second function,
    the second function over-wrote the return addres back to the flush
    function with the ecd791bc, and the kernel died executing whatever
    was there.

If #2 occurred, the text return address back to the flush function would
have been wiped out, so the next one if found was the filp_close() return
address.

-------------------------------------------------------------------------

Second crash in copy_files():

  Unable to handle kernel paging request at virtual address b07e7294
  printing eip: c01208a2
  
  CPU:    0
  EIP:    0060:[<c01208a2>]    Not tainted VLI
  EFLAGS: 00010282   (2.6.9-22.0.1.ELsmp)
  EIP is at copy_files+0x1c0/0x2b7
  eax: b07e7280   ebx: f88e51b8   ecx: 00000000   edx: 00000433
  esi: ed596094   edi: f89311b4   ebp: e8660b80   esp: f2144f10
  ds: 007b   es: 007b   ss: 0068
  
  Process qmail-inject (pid: 7345, threadinfo=f2144000 task=f3605830)
  
  [<c0120d84>] copy_process+0x38a/0xaba
  [<c01215a0>] do_fork+0x8e/0x175
  [<c0104990>] sys_vfork+0x18/0x1c
  [<c02d0fb7>] syscall_call+0x7/0xb

The return address is after the copy_files() call:

  0xc0120d6a <copy_process+880>:  call   0xc019ef7f <copy_semundo>
  0xc0120d6f <copy_process+885>:  test   %eax,%eax
  0xc0120d71 <copy_process+887>:  mov    %eax,%edi
  0xc0120d73 <copy_process+889>:  jne    0xc01213f8 <copy_process+2558>
  0xc0120d79 <copy_process+895>:  mov    0x8(%esp),%eax
  0xc0120d7d <copy_process+899>:  mov    %ebp,%edx
  0xc0120d7f <copy_process+901>:  call   0xc01206e2 <copy_files>
  0xc0120d84 <copy_process+906>:  test   %eax,%eax
  0xc0120d86 <copy_process+908>:  mov    %eax,%edi
  0xc0120d88 <copy_process+910>:  jne    0xc01213f1 <copy_process+2551>

The EIP of c01208a2 is in copy_files(), where it's doing the lock incl:

  0xc0120895 <copy_files+435>:    test   %edx,%edx
  0xc0120897 <copy_files+437>:    je     0xc01208b6 <copy_files+468>
  0xc0120899 <copy_files+439>:    mov    (%ebx),%eax
  0xc012089b <copy_files+441>:    add    $0x4,%ebx
  0xc012089e <copy_files+444>:    test   %eax,%eax
  0xc01208a0 <copy_files+446>:    je     0xc01208a6 <copy_files+452>
  0xc01208a2 <copy_files+448>:    lock incl 0x14(%eax)
  0xc01208a6 <copy_files+452>:    mov    0x1c(%esp),%edi
  0xc01208aa <copy_files+456>:    mov    %eax,(%edi)
  0xc01208ac <copy_files+458>:    add    $0x4,%edi
  0xc01208bf <copy_files+477>:    call   0xc02cfb9e <_spin_unlock>

which is the get_file() in macro copy_files():

        old_fds = oldf->fd;
        new_fds = newf->fd;

        memcpy(newf->open_fds->fds_bits, oldf->open_fds->fds_bits, open_files/8);
        memcpy(newf->close_on_exec->fds_bits, oldf->close_on_exec->fds_bits,
open_files/8);

        for (i = open_files; i != 0; i--) {
                struct file *f = *old_fds++;
                if (f)
                        get_file(f);
                *new_fds++ = f;
        }
        spin_unlock(&oldf->file_lock);

i.e., get_file() is defined like so:

  #define get_file(x)     atomic_inc(&(x)->f_count)

So, while walking through the open file list of the current process,
it came across a bogus file structure pointer of b07e7280.

-------------------------------------------------------------------------

Third crash in filp_close() again:

  Unable to handle kernel paging request at virtual address aaedfca0
  printing eip: c0159409
  
  EIP:    0060:[<c0159409>]    Not tainted VLI
  EFLAGS: 00010282   (2.6.9-22.0.1.ELsmp) 
  EIP is at filp_close+0x5/0x5f   
  eax: aaedfc80   ebx: e12c9b00   ecx: aaedfc80   edx: e12c9b00
  esi: 07ffffff   edi: e12c9b00   ebp: 00000021   esp: c9162f74
  ds: 007b   es: 007b   ss: 0068
  Process qmail-queue (pid: 25011, threadinfo=c9162000 task=e18ee2b0)
  
  [<c01234e7>] put_files_struct+0x57/0xc0
  [<c01240fd>] do_exit+0x227/0x3de
  [<c012439f>] sys_exit_group+0x0/0xd
  [<c02d0fb7>] syscall_call+0x7/0xb
  
put_files_struct() contains the inline function close_files(), which
repeatedly calls filp_close():

  static inline void close_files(struct files_struct * files)
  {
          int i, j;
  
          j = 0;
          for (;;) {
                  unsigned long set;
                  i = j * __NFDBITS;
                  if (i >= files->max_fdset || i >= files->max_fds)
                          break;
                  set = files->open_fds->fds_bits[j++];
                  while (set) {
                          if (set & 1) {
                                  struct file * file = xchg(&files->fd[i], NULL);
                                  if (file) {
                                          filp_close(file, files);
                                          cond_resched();
                                  }
                          }
                          i++;
                          set >>= 1;
                  }
          }
  }

...which again comes across a bogus file structure pointer aaedfc80, 
and passes it to filp_close(), which crashes when referencing it:

  (gdb) disass filp_close
  Dump of assembler code for function filp_close:
  0xc0159404 <filp_close+0>:      push   %edi
  0xc0159405 <filp_close+1>:      mov    %edx,%edi
  0xc0159407 <filp_close+3>:      push   %esi
  0xc0159408 <filp_close+4>:      push   %ebx
  0xc0159409 <filp_close+5>:      mov    0x20(%eax),%esi  
  

-------------------------------------------------------------------------

Fourth crash in filp_close():

  Unable to handle kernel NULL pointer dereference at virtual address 0000002d
  printing eip: c0159434
  
  EIP:    0060:[<c0159434>]    Not tainted VLI
  EFLAGS: 00010202   (2.6.9-22.0.1.ELsmp)
  EIP is at filp_close+0x30/0x5f 
  eax: 00000001   ebx: e0684c80   ecx: e0684c80   edx: f1dfe480
  esi: dead4ead   edi: f1dfe480   ebp: 00000022   esp: f6798f74
  ds: 007b   es: 007b   ss: 0068
  Process qmail-queue (pid: 20762, threadinfo=f6798000 task=e9d95330)
  
  [<c01234e7>] put_files_struct+0x57/0xc0
  [<c01240fd>] do_exit+0x227/0x3de
  [<c012439f>] sys_exit_group+0x0/0xd
  [<c02d0fb7>] syscall_call+0x7/0xb

Here again, put_files_struct() has called filp_close() with an
invalid, yet readable, value of e0684c80, and got as far as the 
filp->f_op->flush dereference, where filp->f_op contains a value 
of 1:

        if (filp->f_op && filp->f_op->flush) {
                int err = filp->f_op->flush(filp);
                if (!retval)
                        retval = err;
        }

-------------------------------------------------------------------------

Fifth crash in generic_fillattr():

  Unable to handle kernel paging request at virtual address 823c2208
  printing eip: c0161611
  
  EIP:    0060:[<c0161611>]    Not tainted VLI
  EFLAGS: 00010246   (2.6.9-22.0.1.ELsmp)
  EIP is at generic_fillattr+0xd/0x99
  eax: 823c2200   ebx: e2768f6c   ecx: dba2dde8   edx: e2768f6c
  esi: 00000000   edi: e2768f6c   ebp: dba2dde8   esp: e2768ee4
  ds: 007b   es: 007b   ss: 0068
  Process find (pid: 5518, threadinfo=e2768000 task=e85550b0)
  
  [<c01616dd>] vfs_getattr+0x40/0x88
  [<c0161784>] vfs_lstat+0x25/0x37
  [<c0161d65>] sys_lstat64+0xf/0x23
  [<c015945d>] filp_close+0x59/0x5f
  [<c02d0fb7>] syscall_call+0x7/0xb

where the "find" process is doing an lstat on an inode.  vfs_gettattr()
calls generic_fillattr() if the inode's i_op doesn't have its own
getattr function:

  int vfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
  {
          struct inode *inode = dentry->d_inode;
          int retval;
  
          retval = security_inode_getattr(mnt, dentry);
          if (retval)
                  return retval;
  
          if (inode->i_op->getattr)
                  return inode->i_op->getattr(mnt, dentry, stat);
  
          generic_fillattr(inode, stat);
          if (!stat->blksize) {
                  struct super_block *s = inode->i_sb;
                  unsigned blocks;
                  blocks = (stat->size+s->s_blocksize-1) >> s->s_blocksize_bits;
                  stat->blocks = (s->s_blocksize / 512) * blocks;
                  stat->blksize = s->s_blocksize;
          }
          return 0;
  }
  
generic_fillattr() does just that, filling in the passed-in kstat
structure:

  void generic_fillattr(struct inode *inode, struct kstat *stat)
  {
          stat->dev = inode->i_sb->s_dev;
          stat->ino = inode->i_ino;
          stat->mode = inode->i_mode;
          stat->nlink = inode->i_nlink;
          stat->uid = inode->i_uid;
          stat->gid = inode->i_gid;
          stat->rdev = inode->i_rdev;
          stat->atime = inode->i_atime;
          stat->mtime = inode->i_mtime;
          stat->ctime = inode->i_ctime;
          stat->size = i_size_read(inode);
          stat->blocks = inode->i_blocks;
          stat->blksize = inode->i_blksize;
  }
 
The inode value passed in was dba2dde8, probably bogus, but readable.
But it's i_sb value 823c2200, which when dereferenced as i_sb->s_dev
crashed.

So in this case, the question is where the inode reference came
from, based upon the dentry found by user_path_walk_link():
  
  int vfs_lstat(char __user *name, struct kstat *stat)
  {
          struct nameidata nd;
          int error;
  
          error = user_path_walk_link(name, &nd);
          if (!error) {
                  error = vfs_getattr(nd.mnt, nd.dentry, stat);
                  path_release(&nd);
          }
          return error;
  }

-------------------------------------------------------------------------

In all of 5 cases, there's not nearly enough information to
determine what led up to the corrupt file-related structures,
i.e., the system "context" at the time of the crashes.

In order to continue debugging this, we're going to need a netdump
or diskdump, whichever you are capable of setting up.  With that
extra data, hopefully we can gather enough further evidence to figure
out what happened.

Comment 10 Jiri Pallich 2012-06-20 16:09:33 UTC

Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.

Note You need to log in before you can comment on or make changes to this bug.