From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2) Gecko/20021127 Description of problem: I first reported weird crashes on redhat lists and lkml (http://www.ussg.iu.edu/hypermail/linux/kernel/0304.0/1773.html, http://www.redhat.com/archives/valhalla-list/2003-April/msg00192.html). Since then i reinstalled all machines with Advanced Server, but the problem is still there. What i know so far: If the box is unplugged from network, it works perfectly. When it is on the network, it can work work days without problems. Oopsen appear randomly, either in 'showers', leaving the box unusuable for hours, or one here and there minutes apart. Oopsen have a weak correlation with the daylight hours, when they started a mont ago, they first appeared at the peak hours. I observed oopsen going on from 23pm to 5am at least three times. For the last few days, they mostly appear approx every ~45min during the day. Oopsen are triggered by random processes ... from postfix smtpd to a simple ls. No pattern here. Different kernels behave a bit differently: 2.4.19-pre10aa4 i was running on redhat 6.2 returned segfault for every process i was attempting to start but kept on going. 2.4.18-26 was similiar. But 2.4.9-e.16 and later freeze immediately on first oops, as does for example 2.4.21-rc2. Version-Release number of selected component (if applicable): kernel-2.4.9-e.16, 17, 20, standard marcelo kernel How reproducible: Didn't try Steps to Reproduce: No idea (still). Additional info: I patched 2.4.21-rc2 with lkcd and have the dumps available: http://nerv.eu.org/wwwmemdump0.tar.gz http://nerv.eu.org/wwwmemdump1.tar.gz http://nerv.eu.org/wwwmemdump2.tar.gz As the box in question has 4gb of ram, each of these files is ~1Gb. Examining them with lcrash shows one common point. There is a process spawned from its parrent that hasn't yet switched its uid and has zero size. So with my limited knowledge of kernel internals something went seriously wrong at the fork() stage. Since i suspect that this is remotely triggered, i also have some tcpdumps available: http://nerv.eu.org/dump.tar.bz2 As for security, i confirmed that there are no changed files with tripwire and aide.
I just sent some more info on ext3-users and postfix-users mailing lists. Unable to handle kernel NULL pointer dereference at virtual address 00000004 *pde = 2bc68001 Oops: 0000 CPU: 3 EIP: 0010:[<f581d7db>] Not tainted Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010286 eax: bffec974 ebx: ebc66000 ecx: f5820000 edx: 00000000 esi: c01073c3 edi: 0000000b ebp: ebc67fb8 esp: ebc67f80 ds: 0018 es: 0018 ss: 0018 Process smtp (pid: 1515, stackpage=ebc67000) Stack: ebc66000 c01073c3 0000000b ed229000 c01519de ed229000 ed229000 0000000b 00000000 bffec974 0000000b 00000000 00000a3a 00000020 bffec0b8 f581d9c4 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 Call Trace: [<c01073c3>] system_call [kernel] 0x33 [<c01519de>] getname [kernel] 0x5e Code: 8b 42 04 83 f8 ff 0f 84 69 01 00 00 83 f8 fc 77 07 c7 42 04 >>EIP; f581d7db <_end+353fd603/383dfe88> <===== Trace; c01073c3 <system_call+33/38> Trace; c01519de <getname+5e/a0> Code; f581d7db <_end+353fd603/383dfe88> 00000000 <_EIP>: Code; f581d7db <_end+353fd603/383dfe88> <===== 0: 8b 42 04 mov 0x4(%edx),%eax <===== Code; f581d7de <_end+353fd606/383dfe88> 3: 83 f8 ff cmp $0xffffffff,%eax Code; f581d7e1 <_end+353fd609/383dfe88> 6: 0f 84 69 01 00 00 je 175 <_EIP+0x175> f581d950 <_end+353fd778/383dfe88> Code; f581d7e7 <_end+353fd60f/383dfe88> c: 83 f8 fc cmp $0xfffffffc,%eax Code; f581d7ea <_end+353fd612/383dfe88> f: 77 07 ja 18 <_EIP+0x18> f581d7f3 <_end+353fd61b/383dfe88> Code; f581d7ec <_end+353fd614/383dfe88> 11: c7 42 04 00 00 00 00 movl $0x0,0x4(%edx) With oops like this (on 2.4.9-e.17enterprise) the code line decodes into: 0x80494b8 <oops>: cmp %ah,0x42(%edx) 0x80494bb <oops+3>: add $0x83,%al 0x80494bd <oops+5>: clc 0x80494be <oops+6>: decl (%edi) 0x80494c0 <oops+8>: test %ch,0x1(%ecx) 0x80494c3 <oops+11>: add %al,(%eax) 0x80494c5 <oops+13>: cmp $0xfffffffc,%eax 0x80494c8 <oops+16>: ja 0x80494d1 <force_to_data+1> 0x80494ca <oops+18>: movl $0x0,0x4(%edx) And gdb says this for getname and system_call: Dump of assembler code for function getname: 0xc0151980 <getname>: push %ebp 0xc0151981 <getname+1>: mov 0xc03d42f8,%edx 0xc0151987 <getname+7>: push %edi 0xc0151988 <getname+8>: push %esi 0xc0151989 <getname+9>: mov $0xfffffff4,%esi 0xc015198e <getname+14>: push %ebx 0xc015198f <getname+15>: mov 0x14(%esp,1),%ebp 0xc0151993 <getname+19>: push $0xf0 0xc0151998 <getname+24>: push %edx 0xc0151999 <getname+25>: call 0xc0138f90 <kmem_cache_alloc> 0xc015199e <getname+30>: pop %ebx 0xc015199f <getname+31>: mov %eax,%edi 0xc01519a1 <getname+33>: test %edi,%edi 0xc01519a3 <getname+35>: pop %eax 0xc01519a4 <getname+36>: je 0xc0151a12 <getname+146> 0xc01519a6 <getname+38>: cmp $0xbfffffff,%ebp 0xc01519ac <getname+44>: mov $0x1000,%ebx 0xc01519b1 <getname+49>: jbe 0xc01519c7 <getname+71> 0xc01519b3 <getname+51>: mov $0xffffe000,%eax 0xc01519b8 <getname+56>: and %esp,%eax 0xc01519ba <getname+58>: cmpl $0xffffffff,0xc(%eax) 0xc01519be <getname+62>: je 0xc01519d6 <getname+86> 0xc01519c0 <getname+64>: mov $0xfffffff2,%ebx 0xc01519c5 <getname+69>: jmp 0xc01519fb <getname+123> 0xc01519c7 <getname+71>: mov $0xc0000000,%eax 0xc01519cc <getname+76>: sub %ebp,%eax 0xc01519ce <getname+78>: cmp $0xfff,%eax 0xc01519d3 <getname+83>: cmovbe %eax,%ebx 0xc01519d6 <getname+86>: push %ebx 0xc01519d7 <getname+87>: push %ebp 0xc01519d8 <getname+88>: push %edi 0xc01519d9 <getname+89>: call 0xc022b520 <strncpy_from_user> 0xc01519de <getname+94>: add $0xc,%esp 0xc01519e1 <getname+97>: test %eax,%eax 0xc01519e3 <getname+99>: jle 0xc01519f1 <getname+113> 0xc01519e5 <getname+101>: cmp %ebx,%eax 0xc01519e7 <getname+103>: sbb %ebx,%ebx 0xc01519e9 <getname+105>: and $0x24,%ebx 0xc01519ec <getname+108>: sub $0x24,%ebx 0xc01519ef <getname+111>: jmp 0xc01519fb <getname+123> 0xc01519f1 <getname+113>: mov $0xfffffffe,%ebx 0xc01519f6 <getname+118>: test %eax,%eax 0xc01519f8 <getname+120>: cmovne %eax,%ebx 0xc01519fb <getname+123>: test %ebx,%ebx 0xc01519fd <getname+125>: mov %edi,%esi 0xc01519ff <getname+127>: jns 0xc0151a12 <getname+146> 0xc0151a01 <getname+129>: push %esi 0xc0151a02 <getname+130>: mov 0xc03d42f8,%ecx 0xc0151a08 <getname+136>: mov %ebx,%esi 0xc0151a0a <getname+138>: push %ecx 0xc0151a0b <getname+139>: call 0xc0139200 <kmem_cache_free> 0xc0151a10 <getname+144>: pop %eax 0xc0151a11 <getname+145>: pop %edx 0xc0151a12 <getname+146>: pop %ebx 0xc0151a13 <getname+147>: mov %esi,%eax 0xc0151a15 <getname+149>: pop %esi 0xc0151a16 <getname+150>: pop %edi 0xc0151a17 <getname+151>: pop %ebp 0xc0151a18 <getname+152>: ret 0xc0151a19 <getname+153>: lea 0x0(%esi,1),%esi End of assembler dump. Dump of assembler code for function system_call: 0xc0107390 <system_call>: push %eax 0xc0107391 <system_call+1>: cld 0xc0107392 <system_call+2>: push %es 0xc0107393 <system_call+3>: push %ds 0xc0107394 <system_call+4>: push %eax 0xc0107395 <system_call+5>: push %ebp 0xc0107396 <system_call+6>: push %edi 0xc0107397 <system_call+7>: push %esi 0xc0107398 <system_call+8>: push %edx 0xc0107399 <system_call+9>: push %ecx 0xc010739a <system_call+10>: push %ebx 0xc010739b <system_call+11>: mov $0x18,%edx 0xc01073a0 <system_call+16>: mov %edx,%ds 0xc01073a2 <system_call+18>: mov %edx,%es 0xc01073a4 <system_call+20>: mov $0xffffe000,%ebx 0xc01073a9 <system_call+25>: and %esp,%ebx 0xc01073ab <system_call+27>: cmp $0x100,%eax 0xc01073b0 <system_call+32>: jae 0xc0107445 <badsys> 0xc01073b6 <system_call+38>: testb $0x2,0x18(%ebx) 0xc01073ba <system_call+42>: jne 0xc0107418 <tracesys> 0xc01073bc <system_call+44>: call *0xc02f43e4(,%eax,4) 0xc01073c3 <system_call+51>: mov %eax,0x18(%esp,1) 0xc01073c7 <system_call+55>: nop End of assembler dump. Reading the oops-tracing.txt, this should be enough to figure out the problem by someone who feels at home in kernel internals. My asm knowledge is nonexistant . ..
I can't see any evidence of a kernel bug here. My first impression is that your box[es] appear to be compromised with a rootkit, as near as I can tell. The asm that oopsed is garbage; the return address on the stack is right after the indirection call in system_call(). So somebody has patched the system call table to point to a module, but the module is bogus. Either you are loading a buggy (and very badly behaved) module deliberately, or there's a rootkit on the box. Oh, and tripwire on its own is useful, but it isn't enough to completely verify your system --- most rootkits have the ability to hide the files that they modify from user-space programs. You really need to verify the box from a standalone rescue CD boot to eliminate that possibility.
Ah, booting from the CD was the tip i was missing ... it turned out to be a version of a suckit. Many thanks for help.