90938 – misterious crashes with any kernel

Bug 90938 - misterious crashes with any kernel

Summary: misterious crashes with any kernel

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	2.1
Hardware:	i386
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Larry Woodman
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2003-05-15 16:29 UTC by Jure Pecar
Modified:	2007-11-30 22:06 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2003-05-22 22:06:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jure Pecar 2003-05-15 16:29:19 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2) Gecko/20021127

Description of problem:
I first reported weird crashes on redhat lists and lkml
(http://www.ussg.iu.edu/hypermail/linux/kernel/0304.0/1773.html,
http://www.redhat.com/archives/valhalla-list/2003-April/msg00192.html).

Since then i reinstalled all machines with Advanced Server, but the problem is
still there. 

What i know so far:

If the box is unplugged from network, it works perfectly.

When it is on the network, it can work work days without problems.

Oopsen appear randomly, either in 'showers', leaving the box unusuable for
hours, or one here and there minutes apart.

Oopsen have a weak correlation with the daylight hours, when they started a mont
ago, they first appeared at the peak hours. I observed oopsen going on from 23pm
to 5am at least three times. For the last few days, they mostly appear approx
every ~45min during the day.

Oopsen are triggered by random processes ... from postfix smtpd to a simple ls.
No pattern here.

Different kernels behave a bit differently: 2.4.19-pre10aa4 i was running on
redhat 6.2 returned segfault for every process i was attempting to start but
kept on going. 2.4.18-26 was similiar. But 2.4.9-e.16 and later freeze
immediately on first oops, as does for example 2.4.21-rc2.

Version-Release number of selected component (if applicable):
kernel-2.4.9-e.16, 17, 20, standard marcelo kernel

How reproducible:
Didn't try

Steps to Reproduce:
No idea (still).

Additional info:

I patched 2.4.21-rc2 with lkcd and have the dumps available:

http://nerv.eu.org/wwwmemdump0.tar.gz
http://nerv.eu.org/wwwmemdump1.tar.gz
http://nerv.eu.org/wwwmemdump2.tar.gz

As the box in question has 4gb of ram, each of these files is ~1Gb.

Examining them with lcrash shows one common point. There is a process spawned
from its parrent that hasn't yet switched its uid and has zero size. So with my
limited knowledge of kernel internals something went seriously wrong at the
fork() stage.

Since i suspect that this is remotely triggered, i also have some tcpdumps
available: 
http://nerv.eu.org/dump.tar.bz2

As for security, i confirmed that there are no changed files with tripwire and aide.

Comment 1 Jure Pecar 2003-05-22 01:34:34 UTC

I just sent some more info on ext3-users and postfix-users mailing lists.

Unable to handle kernel NULL pointer dereference at virtual address 00000004
*pde = 2bc68001
Oops: 0000
CPU:    3
EIP:    0010:[<f581d7db>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010286
eax: bffec974   ebx: ebc66000   ecx: f5820000   edx: 00000000
esi: c01073c3   edi: 0000000b   ebp: ebc67fb8   esp: ebc67f80
ds: 0018   es: 0018   ss: 0018
Process smtp (pid: 1515, stackpage=ebc67000)
Stack: ebc66000 c01073c3 0000000b ed229000 c01519de ed229000 ed229000 0000000b 
       00000000 bffec974 0000000b 00000000 00000a3a 00000020 bffec0b8 f581d9c4 
       00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
Call Trace: [<c01073c3>] system_call [kernel] 0x33 
[<c01519de>] getname [kernel] 0x5e 
Code: 8b 42 04 83 f8 ff 0f 84 69 01 00 00 83 f8 fc 77 07 c7 42 04 

>>EIP; f581d7db <_end+353fd603/383dfe88>   <=====
Trace; c01073c3 <system_call+33/38>
Trace; c01519de <getname+5e/a0>
Code;  f581d7db <_end+353fd603/383dfe88>
00000000 <_EIP>:
Code;  f581d7db <_end+353fd603/383dfe88>   <=====
   0:   8b 42 04                  mov    0x4(%edx),%eax   <=====
Code;  f581d7de <_end+353fd606/383dfe88>
   3:   83 f8 ff                  cmp    $0xffffffff,%eax
Code;  f581d7e1 <_end+353fd609/383dfe88>
   6:   0f 84 69 01 00 00         je     175 <_EIP+0x175> f581d950 
<_end+353fd778/383dfe88>
Code;  f581d7e7 <_end+353fd60f/383dfe88>
   c:   83 f8 fc                  cmp    $0xfffffffc,%eax
Code;  f581d7ea <_end+353fd612/383dfe88>
   f:   77 07                     ja     18 <_EIP+0x18> f581d7f3 
<_end+353fd61b/383dfe88>
Code;  f581d7ec <_end+353fd614/383dfe88>
  11:   c7 42 04 00 00 00 00      movl   $0x0,0x4(%edx)


With oops like this (on 2.4.9-e.17enterprise) the code line decodes into:

0x80494b8 <oops>:	cmp    %ah,0x42(%edx)
0x80494bb <oops+3>:	add    $0x83,%al
0x80494bd <oops+5>:	clc    
0x80494be <oops+6>:	decl   (%edi)
0x80494c0 <oops+8>:	test   %ch,0x1(%ecx)
0x80494c3 <oops+11>:	add    %al,(%eax)
0x80494c5 <oops+13>:	cmp    $0xfffffffc,%eax
0x80494c8 <oops+16>:	ja     0x80494d1 <force_to_data+1>
0x80494ca <oops+18>:	movl   $0x0,0x4(%edx)

And gdb says this for getname and system_call:

Dump of assembler code for function getname:
0xc0151980 <getname>:	push   %ebp
0xc0151981 <getname+1>:	mov    0xc03d42f8,%edx
0xc0151987 <getname+7>:	push   %edi
0xc0151988 <getname+8>:	push   %esi
0xc0151989 <getname+9>:	mov    $0xfffffff4,%esi
0xc015198e <getname+14>:	push   %ebx
0xc015198f <getname+15>:	mov    0x14(%esp,1),%ebp
0xc0151993 <getname+19>:	push   $0xf0
0xc0151998 <getname+24>:	push   %edx
0xc0151999 <getname+25>:	call   0xc0138f90 <kmem_cache_alloc>
0xc015199e <getname+30>:	pop    %ebx
0xc015199f <getname+31>:	mov    %eax,%edi
0xc01519a1 <getname+33>:	test   %edi,%edi
0xc01519a3 <getname+35>:	pop    %eax
0xc01519a4 <getname+36>:	je     0xc0151a12 <getname+146>
0xc01519a6 <getname+38>:	cmp    $0xbfffffff,%ebp
0xc01519ac <getname+44>:	mov    $0x1000,%ebx
0xc01519b1 <getname+49>:	jbe    0xc01519c7 <getname+71>
0xc01519b3 <getname+51>:	mov    $0xffffe000,%eax
0xc01519b8 <getname+56>:	and    %esp,%eax
0xc01519ba <getname+58>:	cmpl   $0xffffffff,0xc(%eax)
0xc01519be <getname+62>:	je     0xc01519d6 <getname+86>
0xc01519c0 <getname+64>:	mov    $0xfffffff2,%ebx
0xc01519c5 <getname+69>:	jmp    0xc01519fb <getname+123>
0xc01519c7 <getname+71>:	mov    $0xc0000000,%eax
0xc01519cc <getname+76>:	sub    %ebp,%eax
0xc01519ce <getname+78>:	cmp    $0xfff,%eax
0xc01519d3 <getname+83>:	cmovbe %eax,%ebx
0xc01519d6 <getname+86>:	push   %ebx
0xc01519d7 <getname+87>:	push   %ebp
0xc01519d8 <getname+88>:	push   %edi
0xc01519d9 <getname+89>:	call   0xc022b520 <strncpy_from_user>
0xc01519de <getname+94>:	add    $0xc,%esp
0xc01519e1 <getname+97>:	test   %eax,%eax
0xc01519e3 <getname+99>:	jle    0xc01519f1 <getname+113>
0xc01519e5 <getname+101>:	cmp    %ebx,%eax
0xc01519e7 <getname+103>:	sbb    %ebx,%ebx
0xc01519e9 <getname+105>:	and    $0x24,%ebx
0xc01519ec <getname+108>:	sub    $0x24,%ebx
0xc01519ef <getname+111>:	jmp    0xc01519fb <getname+123>
0xc01519f1 <getname+113>:	mov    $0xfffffffe,%ebx
0xc01519f6 <getname+118>:	test   %eax,%eax
0xc01519f8 <getname+120>:	cmovne %eax,%ebx
0xc01519fb <getname+123>:	test   %ebx,%ebx
0xc01519fd <getname+125>:	mov    %edi,%esi
0xc01519ff <getname+127>:	jns    0xc0151a12 <getname+146>
0xc0151a01 <getname+129>:	push   %esi
0xc0151a02 <getname+130>:	mov    0xc03d42f8,%ecx
0xc0151a08 <getname+136>:	mov    %ebx,%esi
0xc0151a0a <getname+138>:	push   %ecx
0xc0151a0b <getname+139>:	call   0xc0139200 <kmem_cache_free>
0xc0151a10 <getname+144>:	pop    %eax
0xc0151a11 <getname+145>:	pop    %edx
0xc0151a12 <getname+146>:	pop    %ebx
0xc0151a13 <getname+147>:	mov    %esi,%eax
0xc0151a15 <getname+149>:	pop    %esi
0xc0151a16 <getname+150>:	pop    %edi
0xc0151a17 <getname+151>:	pop    %ebp
0xc0151a18 <getname+152>:	ret    
0xc0151a19 <getname+153>:	lea    0x0(%esi,1),%esi
End of assembler dump.

Dump of assembler code for function system_call:
0xc0107390 <system_call>:	push   %eax
0xc0107391 <system_call+1>:	cld    
0xc0107392 <system_call+2>:	push   %es
0xc0107393 <system_call+3>:	push   %ds
0xc0107394 <system_call+4>:	push   %eax
0xc0107395 <system_call+5>:	push   %ebp
0xc0107396 <system_call+6>:	push   %edi
0xc0107397 <system_call+7>:	push   %esi
0xc0107398 <system_call+8>:	push   %edx
0xc0107399 <system_call+9>:	push   %ecx
0xc010739a <system_call+10>:	push   %ebx
0xc010739b <system_call+11>:	mov    $0x18,%edx
0xc01073a0 <system_call+16>:	mov    %edx,%ds
0xc01073a2 <system_call+18>:	mov    %edx,%es
0xc01073a4 <system_call+20>:	mov    $0xffffe000,%ebx
0xc01073a9 <system_call+25>:	and    %esp,%ebx
0xc01073ab <system_call+27>:	cmp    $0x100,%eax
0xc01073b0 <system_call+32>:	jae    0xc0107445 <badsys>
0xc01073b6 <system_call+38>:	testb  $0x2,0x18(%ebx)
0xc01073ba <system_call+42>:	jne    0xc0107418 <tracesys>
0xc01073bc <system_call+44>:	call   *0xc02f43e4(,%eax,4)
0xc01073c3 <system_call+51>:	mov    %eax,0x18(%esp,1)
0xc01073c7 <system_call+55>:	nop    
End of assembler dump.

Reading the oops-tracing.txt, this should be enough to figure out the problem by 
someone who feels at home in kernel internals. My asm knowledge is nonexistant .
..

Comment 2 Stephen Tweedie 2003-05-22 09:45:56 UTC

I can't see any evidence of a kernel bug here.  My first impression is that your
box[es] appear to be compromised with a rootkit, as near as I can tell.  The asm
that oopsed is garbage; the return address on the stack is right after the
indirection call in system_call().  So somebody has patched the system call
table to point to a module, but the module is bogus.

Either you are loading a buggy (and very badly behaved) module deliberately, or
there's a rootkit on the box.

Oh, and tripwire on its own is useful, but it isn't enough to completely verify
your system --- most rootkits have the ability to hide the files that they
modify from user-space programs.  You really need to verify the box from a
standalone rescue CD boot to eliminate that possibility.

Comment 3 Jure Pecar 2003-05-22 22:06:55 UTC

Ah, booting from the CD was the tip i was missing ... it turned out to be a 
version of a suckit.

Many thanks for help.

Note You need to log in before you can comment on or make changes to this bug.