Bug 20005
Summary: | kernel NULL pointer dereference with slocate-2.1-2 | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | tom |
Component: | kernel | Assignee: | Michael K. Johnson <johnsonm> |
Status: | CLOSED NOTABUG | QA Contact: | Brock Organ <borgan> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 6.2 | CC: | dhudes, stephen, trev |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2002-12-15 03:02:33 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
tom
2000-10-29 19:25:26 UTC
That seems similar to a number of crashes I'm getting, on a regular basis: Oct 30 04:02:16 kernel: Unable to handle kernel paging request at virtual address 524a4e7e Oct 30 04:02:16 kernel: current->tss.cr3 = 01303000, %cr3 = 01303000 Oct 30 04:02:16 kernel: *pde = 00000000 Oct 30 04:02:16 kernel: Oops: 0000 Oct 30 04:02:16 kernel: CPU: 0 Oct 30 04:02:16 kernel: EIP: 0010:[<524a4e7e>] Oct 30 04:02:16 kernel: EFLAGS: 00010203 Oct 30 04:02:16 kernel: eax: c1879288 ebx: c0308000 ecx: 00000003 edx: c28ae248 Oct 30 04:02:16 kernel: esi: c2bbf1c0 edi: c2ef9060 ebp: 00000003 esp: c0309f1c Oct 30 04:02:16 kernel: ds: 0018 es: 0018 ss: 0018 Oct 30 04:02:16 kernel: Process slocate (pid: 4341, process nr: 45, stackpage=c0309000) Oct 30 04:02:16 kernel: Stack: c2bbf1c0 c2ef9060 00000003 c2bbf1c0 c2bbf1c0 c1b9b004 c012d243 c2ef9060 Oct 30 04:02:16 kernel: c2bbf1c0 00000003 c2e75c60 ffffffe9 00010801 bffffcc8 c1b9b000 00000004 Oct 30 04:02:16 kernel: 0006e9b5 c012d38a c1b9b000 c2ef9060 00000003 c2e75c60 ffffffe9 08050ac8 Oct 30 04:02:16 kernel: Call Trace: [lookup_dentry+351/488] [open_namei+102/848] [filp_open+68/240] [sys_open+54/148] [system_call+52/56] Oct 30 04:02:17 kernel: Code: <1>Unable to handle kernel paging request at virtual address 524a4e7e Oct 30 04:02:17 kernel: current->tss.cr3 = 01303000, %cr3 = 01303000 Oct 30 04:02:17 kernel: *pde = 00000000 Oct 30 04:02:17 kernel: Oops: 0000 Oct 30 04:02:17 kernel: CPU: 0 Oct 30 04:02:17 kernel: EIP: 0010:[show_registers+589/640] Oct 30 04:02:17 kernel: EFLAGS: 00010046 Oct 30 04:02:17 kernel: eax: 00000000 ebx: 00000000 ecx: 524a4e7e edx: c2610000 Oct 30 04:02:17 kernel: esi: 0000002b edi: c030a000 ebp: c3800000 esp: c0309e5c Oct 30 04:02:17 kernel: ds: 0018 es: 0018 ss: 0018 Oct 30 04:02:17 kernel: Process slocate (pid: 4341, process nr: 45, stackpage=c0309000) Oct 30 04:02:17 kernel: Stack: 524a4e7e 6db5a000 c024cf8e c2bbf1c0 c2ef9060 00000003 c1879288 c0308000 Oct 30 04:02:17 kernel: 00000003 c28ae248 524a4e7e 00010203 03000000 c4000000 c010a4e4 c0309ee0 Oct 30 04:02:17 kernel: c01defb8 c01e0a2e 00000000 00000000 c010f758 c01e0a2e c0309ee0 00000000 Oct 30 04:02:17 kernel: Call Trace: [<c4000000>] [die+48/56] [error_table+2572/9948] [error_table+9346/9948] [do_page_fault+700/900] [error_table+9346/9948] [error_code+45/52] Oct 30 04:02:17 kernel: [do_follow_link+76/132] [lookup_dentry+351/488] [open_namei+102/848] [filp_open+68/240] [sys_open+54/148] [system_call+52/56] Oct 30 04:02:17 kernel: Code: 8a 04 0b 89 44 24 38 50 68 b0 ef 1d c0 e8 81 9e 00 00 83 c4 What kernel - do you have the errata kernel installed? In my case, I orignally installed RH 5.1, then performed an upgrade to 6.2. This problem has been going on ever since my first install, and the upgrade to 6.2 reduced the frequency, but still hasn't solved the problem. I haven't installed the Errata kernel yet. Linux version 2.2.14-5.0 (root.redhat.com) (gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)) #1 Tue Mar 7 20:53:41 EST 2000 I (Tom, first bug report) am running the original vmlinuz-2.2.14-5.0 kernel that came with 6.2 (the same as trev's). I am planning to upgrade that machine to the errata kernel with ipmasq PPTP patches (which I'm currently running on another machine). Should I move forward with the upgrade and followup with whether slocate still fails? I believe (although I'm not 100% sure) that this problem was solved in the errata kernel. I've just upgraded to the 2.2.16-3 errata kernel, and updated the utils and because it complained, pcmcia RPMs as well. We'll see if I wake up to find any more crashes in about two days time. At exactly the same time, when slocate is run... I heard a lot of disk activity, and after a few moments it just stopped. Sure enough when I turn the monitor on, there's an Oops error on the screen. Linux version 2.2.16-3 (root.redhat.com) (gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)) #1 Mon Jun 19 18:49:25 EDT 2000 Nov 5 04:02:01 syslogd 1.3-3: restart. Nov 5 04:02:01 syslogd 1.3-3: restart. Nov 5 04:02:02 syslogd 1.3-3: restart. Nov 5 04:02:02 syslogd 1.3-3: restart. Nov 5 04:02:02 syslogd 1.3-3: restart. Nov 5 04:03:21 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000004 Nov 5 04:03:21 kernel: current->tss.cr3 = 0279f000, %cr3 = 0279f000 Nov 5 04:03:21 kernel: *pde = 00000000 Nov 5 04:03:21 kernel: Oops: 0002 Nov 5 04:03:21 kernel: CPU: 0 Nov 5 04:03:21 kernel: EIP: 0010:[__free_inodes+41/100] Nov 5 04:03:21 kernel: EFLAGS: 00010246 Nov 5 04:03:21 kernel: eax: 00000000 ebx: c1544778 ecx: c1a1bcb0 edx: 00000000 Nov 5 04:03:21 kernel: esi: 00000000 edi: 00000601 ebp: c0347ed0 esp: c0347e98 Nov 5 04:03:21 kernel: ds: 0018 es: 0018 ss: 0018 Nov 5 04:03:21 kernel: Process slocate (pid: 1959, process nr: 27, stackpage=c0347000) Nov 5 04:03:21 kernel: Stack: c021b2e4 00000601 c013224e c0347ed0 fffffa00 00000601 00000000 c0258460 Nov 5 04:03:21 kernel: c021b2e4 c0258460 c0391380 c1229540 c122958c 00000002 c04ee888 c28c9108 Nov 5 04:03:21 kernel: c01322aa 00000601 00000000 c0258460 c021b2e4 c0258460 c0258458 c01325f1 Nov 5 04:03:21 kernel: Call Trace: [try_to_free_inodes+210/264] [grow_inodes+30/384] [get_new_inode+173/280] [get_new_inode+185/280] [iget+88/96] [ext2_lookup+84/124] [real_lookup+79/160] Nov 5 04:03:21 kernel: [lookup_dentry+296/488] [__namei+40/88] [sys_newlstat+14/96] [system_call+52/56] Nov 5 04:03:21 kernel: Code: 89 50 04 89 02 8d 53 f8 8b 43 f8 8b 4b fc 89 48 04 89 01 89 OK, looks like a kernel bug. Reassigning there. To possibly help debug this: - what filesystems (types) do you have mounted - do they all check OK if you force a check? All I have is one ext2 filesystem: Filesystem Size Used Avail Use% Mounted on /dev/hda1 1.8G 1.0G 683M 61% / And yes, running fsck agrees that everything is fine. I have encountered similar problems. I have a number of very larger (gigabyte- level) tar files consisting of individually gzip'd files that I am converting to bzip2 (saves megabytes). I created a Perl script to extract each file, then invoke gzip -d then bzip2 -z . Previously have encountered situation where when slocate runs while gzip or bzip is running on these huge files, the gzip or bzip process is hung and the filesystem (ext2, a separate 2Gb partition for this exercise) has been corrupted. The only way to stop the gzip/bzip is to reboot -- until then you can't access the filesystem in question (/dev/sda12). Now have hit the wall earlier by starting early. extracted flows.20000626_19:56:10-0400.gz 17162135 bytes from flow626PM.tar in 8 8.035609 seconds decompressed flows.20000626_19:56:10-0400.gz result is 0x01ran withsignal 1 Message from syslogd@harmony at Sat Nov 25 21:34:20 2000 ... harmony kernel: Unable to handle kernel paging request at virtual address 0100f2 11 Message from syslogd@harmony at Sat Nov 25 21:34:20 2000 ... harmony kernel: current->tss.cr3 = 07775000, %cr3 = 07775000 Message from syslogd@harmony at Sat Nov 25 21:34:20 2000 ... harmony kernel: *pde = 00000000 Despite gzip seemingly completing ok, the .gz file is still around. Here is the ls -l for the last file processed: rw------- 1 dhudes dhudes 25985024 Nov 25 21:34 flows.20000626_19:56:10- 0400 -rw------- 1 dhudes dhudes 3682304 Nov 25 21:35 flows.20000626_19:56:10- 0400.bz2 -rw-r--r-- 1 dhudes dhudes 17162135 Jun 26 19:56 flows.20000626_19:56:10- 0400.gz I have stopped NFS and nfs locking services. I do have 2 setiathome processes running. this machine is 128Mb dual PII-333 . Samba is running, I'm going to try one more time without it. while Apache is running (and named), see this vmstat report: procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 164 4684 82796 17604 0 1 78 2 73 27 0 1 99 (with no setiathome running). Kernel should not take out the system if it runs out of paging. It should probably terminate the offending process (and log the event of course). I am also having a similar problem. I get kernel Aiee messages like this every couple days! --- svc: unknown version (3) svc: unknown program 100227 (me 100003) svc: unknown version (3) scsi0 channel 0 : resetting for second half of retries. SCSI bus is being reset for host 0 channel 0. Unable to handle kernel NULL pointer dereference at virtual address 00000114 current->tss.cr3 = 00101000, %cr3 = 00101000 *pde = 00000000 Oops: 0002 CPU: 0 EIP: 0010:[<e00015b9>] EFLAGS: 00010002 eax: df42a078 ebx: df42a078 ecx: 00000001 edx: 00000092 esi: 00000000 edi: 00000046 ebp: 00000006 esp: c02514c0 ds: 0018 es: 0018 ss: 0018 Process swapper (pid: 0, process nr: 0, stackpage=c0251000) Stack: df42a000 df3f5a00 00000000 00000001 dfdd1620 c01b8ad4 df3f5a00 00000001 00000001 df3f5a00 00000002 00000027 00000001 c01b86ad df3f5a00 00000001 c020a720 00000000 00000000 df42a078 df3f5a00 c01b824c 00000000 df42a000 Call Trace: [<c01b8ad4>] [<c01b86ad>] [<c020a720>] [<c01b824c>] [<e00022e7>] [<e000007b>] [<e000248c>] [<c01b8010>] [<c01b67c0>] [<c01b4faf>] [<c01b5073>] [<c01b824c>] [<c01b871c>] [<c01b824c>] [<e00022e7>] [<c01b8010>] [<c01b67c0>] [<c01b2af5>] [<c01b2bca>] [<c01b824c>] [<c01bead8>] [<c020b8ed>] [<c01c0578>] [<c01bead8>] [<c01e49dc>] [<c01bf47f>] [<c01e0008>] [<c01bf448>] [<c01bf47f>] [<c0173fdd>] [<c01727df>] [<c01a3f32>] [<c01be7a5>] [<c01beec5>] [<c01bf056>] [<c01b87dc>] [<e0005a07>] [<e0005d03>] [<c010c0b6>] [<c0110f72>] [<c010c227>] [<c010b22c>] [<c0108a4d>] [<c0106000>] [<c0106000>] [<c01001ae>] Code: c7 86 14 01 00 00 00 00 08 00 56 8b 86 e8 00 00 00 ff d0 83 Aiee, killing interrupt handler Kernel panic: Attempted to kill the idle task! In swapper task - not syncing ---- My kernel is Linux 2.2.16-3 #6 SMP Tue Nov 14 12:35:47 PST 2000 i686 unknown. df Filesystem 1k-blocks Used Available Use% Mounted on /dev/sda5 202220 113001 78779 59% / /dev/sdb1 286735372 14093112 272642260 5% /export /dev/sda1 907072 832576 74496 92% /dosc /dev/sda8 1517920 789896 650916 55% /usr /sbin/lsmod Module Size Used by nfs 28864 1 (autoclean) nfsd 144068 8 (autoclean) autofs 9248 6 (autoclean) lockd 31656 1 (autoclean) [nfs nfsd] sunrpc 54084 1 (autoclean) [nfs nfsd lockd] eepro100 16564 1 (autoclean) nls_cp437 3876 2 (autoclean) vfat 9404 1 (autoclean) fat 31104 1 (autoclean) [vfat] reiserfs 133160 1 (autoclean) dpt_i2o 96288 6 free total used free shared buffers cached Mem: 515084 294096 220988 19964 170420 94120 -/+ buffers/cache: 29556 485528 Swap: 530064 0 530064 cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 3 cpu MHz : 795.501 cache size : 256 KB fdiv_bug : no hlt_bug : no sep_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr xmm bogomips : 1585.97 processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 8 model name : Pentium III (Coppermine) stepping : 3 cpu MHz : 795.501 cache size : 256 KB fdiv_bug : no hlt_bug : no sep_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr xmm bogomips : 1589.25 This looks a lot like dodgy memory. slocate definitely excercises the disk subsystem. Try to make a list of oopses and the physical addresses of all the function calls, it's possible that there is a bank of memory that is bad and is causing this. Make sure that you are not overclocking as well, low quality SDRAM will not tolerate higher speeds for long. My system orignally had 48 MB ram (2x16 and 2x8 chips, all EDO), and about 10 days ago I upgraded it to 80 MB (2x16, 2x32, all EDO) and it's been up ever since.... so perhaps the RAM chips were the problem... guess Windoze ME doesn't use them enough to care (that's the 48 is in now). Update from Tom (original poster): I compiled a custom kernel for the machine that was giving me trouble, and I no longer get those errors when slocate runs. My version came out about 60k smaller than 2.2.14-5.0 (I compiled out SCSI support and targeted an AMD-K6). The new ram seems to have fixed my problem. I've been up ever since without any crashes. |