Bug 20005

Summary:	kernel NULL pointer dereference with slocate-2.1-2
Product:	[Retired] Red Hat Linux	Reporter:	tom
Component:	kernel	Assignee:	Michael K. Johnson <johnsonm>
Status:	CLOSED NOTABUG	QA Contact:	Brock Organ <borgan>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	6.2	CC:	dhudes, stephen, trev
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2002-12-15 03:02:33 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description tom 2000-10-29 19:25:26 UTC

When slocate.cron runs at night, I get the following in the syslog:

> Oct 28 00:02:29 kernel: Unable to handle kernel NULL pointer dereference at
> virtual address 00000005
> Oct 28 00:02:29 kernel: current->tss.cr3 = 01171000, %%cr3 = 01171000
> Oct 28 00:02:29 kernel: *pde = 00000000
> Oct 28 00:02:29 kernel: Oops: 0000
> Oct 28 00:02:29 kernel: CPU:    0
> Oct 28 00:02:29 kernel: EIP:    0010:[d_lookup+100/220]
> Oct 28 00:02:29 kernel: EFLAGS: 00010203
> Oct 28 00:02:29 kernel: eax: c3c57a80   ebx: ffffffed   ecx: 00000024   edx:
> c3c00000 
> Oct 28 00:02:29 kernel: esi: c169f000   edi: c169f00e   ebp: 00000005   esp:
> c0833f40 
> Oct 28 00:02:29 kernel: ds: 0018   es: 0018   ss: 0018
> Oct 28 00:02:29 kernel: Process slocate (pid: 10836, process nr: 8,
> stackpage=c0833000)
> Oct 28 00:02:29 kernel: Stack: c169f00e 00000000 c3c57a80 c169f000 
501b9a9d
> 0000000e c012cf7c c2fd0ac0
> Oct 28 00:02:29 kernel:        c0833f88 c0833f88 c012d1f7 c2fd0ac0 c0833f88
> 00000000 c169f000 c169f000
> Oct 28 00:02:29 kernel:        0806c204 bffffc38 c169f000 0000000e 501b9a9d
> c012d2f4 c169f000 c2fd0ac0
> Oct 28 00:02:29 kernel: Call Trace: [cached_lookup+16/84]
> [lookup_dentry+275/488] [__namei+40/88] [sys_newlstat+14/96]
> [system_call+52/56]
> Oct 28 00:02:29 kernel: Code: 8b 6d 00 8b 74 24 18 39 73 48 75 58 8b 74 24 24
> 39 73 0c 75  

Filesystem info, if that helps:
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/hda1              2861360    740804   1975204  27% /
/dev/hdc1             16247612   7855248   7567028  51% /home
/dev/hdd1             16247612   8810804   6611472  57% /data

Comment 1 Trevor Peirce 2000-10-30 16:22:04 UTC

That seems similar to a number of crashes I'm getting, on a regular basis:

Oct 30 04:02:16 kernel: Unable to handle kernel paging request at virtual address 524a4e7e
Oct 30 04:02:16 kernel: current->tss.cr3 = 01303000, %cr3 = 01303000
Oct 30 04:02:16 kernel: *pde = 00000000
Oct 30 04:02:16 kernel: Oops: 0000
Oct 30 04:02:16 kernel: CPU:    0
Oct 30 04:02:16 kernel: EIP:    0010:[<524a4e7e>]
Oct 30 04:02:16 kernel: EFLAGS: 00010203
Oct 30 04:02:16 kernel: eax: c1879288   ebx: c0308000   ecx: 00000003 edx: c28ae248
Oct 30 04:02:16 kernel: esi: c2bbf1c0   edi: c2ef9060   ebp: 00000003 esp: c0309f1c
Oct 30 04:02:16 kernel: ds: 0018   es: 0018   ss: 0018
Oct 30 04:02:16 kernel: Process slocate (pid: 4341, process nr: 45, stackpage=c0309000)
Oct 30 04:02:16 kernel: Stack: c2bbf1c0 c2ef9060 00000003 c2bbf1c0 c2bbf1c0 c1b9b004 c012d243 c2ef9060
Oct 30 04:02:16 kernel:        c2bbf1c0 00000003 c2e75c60 ffffffe9 00010801 bffffcc8 c1b9b000 00000004
Oct 30 04:02:16 kernel:        0006e9b5 c012d38a c1b9b000 c2ef9060 00000003 c2e75c60 ffffffe9 08050ac8
Oct 30 04:02:16 kernel: Call Trace: [lookup_dentry+351/488] [open_namei+102/848] [filp_open+68/240] [sys_open+54/148] [system_call+52/56]
Oct 30 04:02:17 kernel: Code: <1>Unable to handle kernel paging request at virtual address 524a4e7e
Oct 30 04:02:17 kernel: current->tss.cr3 = 01303000, %cr3 = 01303000
Oct 30 04:02:17 kernel: *pde = 00000000
Oct 30 04:02:17 kernel: Oops: 0000
Oct 30 04:02:17 kernel: CPU:    0
Oct 30 04:02:17 kernel: EIP:    0010:[show_registers+589/640]
Oct 30 04:02:17 kernel: EFLAGS: 00010046
Oct 30 04:02:17 kernel: eax: 00000000   ebx: 00000000   ecx: 524a4e7e edx: c2610000
Oct 30 04:02:17 kernel: esi: 0000002b   edi: c030a000   ebp: c3800000 esp: c0309e5c
Oct 30 04:02:17 kernel: ds: 0018   es: 0018   ss: 0018
Oct 30 04:02:17 kernel: Process slocate (pid: 4341, process nr: 45, stackpage=c0309000)
Oct 30 04:02:17 kernel: Stack: 524a4e7e 6db5a000 c024cf8e c2bbf1c0 c2ef9060 00000003 c1879288 c0308000
Oct 30 04:02:17 kernel:        00000003 c28ae248 524a4e7e 00010203 03000000 c4000000 c010a4e4 c0309ee0
Oct 30 04:02:17 kernel:        c01defb8 c01e0a2e 00000000 00000000 c010f758 c01e0a2e c0309ee0 00000000
Oct 30 04:02:17 kernel: Call Trace: [<c4000000>] [die+48/56] [error_table+2572/9948] [error_table+9346/9948] [do_page_fault+700/900] 
[error_table+9346/9948]
 [error_code+45/52]
Oct 30 04:02:17 kernel:        [do_follow_link+76/132] [lookup_dentry+351/488] [open_namei+102/848] [filp_open+68/240] [sys_open+54/148] 
[system_call+52/56]
Oct 30 04:02:17 kernel: Code: 8a 04 0b 89 44 24 38 50 68 b0 ef 1d c0 e8 81 9e 00 00 83 c4

Comment 2 Bill Nottingham 2000-10-30 17:21:41 UTC

What kernel - do you have the errata kernel installed?

Comment 3 Trevor Peirce 2000-10-30 17:34:24 UTC

In my case, I orignally installed RH 5.1, then performed an upgrade to 6.2.  This problem has been going on ever since my first install, and the upgrade to 
6.2 reduced the frequency, but still hasn't solved the problem.  I haven't installed the Errata kernel yet.

Linux version 2.2.14-5.0 (root.redhat.com) (gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)) #1 Tue Mar 7 20:53:41 EST 2000

Comment 4 tom 2000-10-30 17:37:56 UTC

I (Tom, first bug report) am running the original vmlinuz-2.2.14-5.0 kernel that came with 
6.2 (the same as trev's).  I am planning to upgrade that machine to the errata kernel 
with ipmasq PPTP patches (which I'm currently running on another machine).

Should I move forward with the upgrade and followup with whether slocate still fails?

Comment 5 Bill Nottingham 2000-10-31 23:47:15 UTC

I believe (although I'm not 100% sure) that this problem was
solved in the errata kernel.

Comment 6 Trevor Peirce 2000-11-05 05:53:29 UTC

I've just upgraded to the 2.2.16-3 errata kernel, and updated the utils and because it complained, pcmcia RPMs as well.

We'll see if I wake up to find any more crashes in about two days time.

Comment 7 Trevor Peirce 2000-11-05 12:22:09 UTC

At exactly the same time, when slocate is run... I heard a lot of disk activity, and after a few moments it just stopped.  Sure enough when I turn the 
monitor on, there's an Oops error on the screen.

Linux version 2.2.16-3 (root.redhat.com) (gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)) #1 Mon Jun 19 18:49:25 EDT 2000

Nov  5 04:02:01 syslogd 1.3-3: restart.
Nov  5 04:02:01 syslogd 1.3-3: restart.
Nov  5 04:02:02 syslogd 1.3-3: restart.
Nov  5 04:02:02 syslogd 1.3-3: restart.
Nov  5 04:02:02 syslogd 1.3-3: restart.
Nov  5 04:03:21 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000004
Nov  5 04:03:21 kernel: current->tss.cr3 = 0279f000, %cr3 = 0279f000
Nov  5 04:03:21 kernel: *pde = 00000000
Nov  5 04:03:21 kernel: Oops: 0002
Nov  5 04:03:21 kernel: CPU:    0
Nov  5 04:03:21 kernel: EIP:    0010:[__free_inodes+41/100]
Nov  5 04:03:21 kernel: EFLAGS: 00010246
Nov  5 04:03:21 kernel: eax: 00000000   ebx: c1544778   ecx: c1a1bcb0 edx: 00000000
Nov  5 04:03:21 kernel: esi: 00000000   edi: 00000601   ebp: c0347ed0 esp: c0347e98
Nov  5 04:03:21 kernel: ds: 0018   es: 0018   ss: 0018
Nov  5 04:03:21 kernel: Process slocate (pid: 1959, process nr: 27, stackpage=c0347000)
Nov  5 04:03:21 kernel: Stack: c021b2e4 00000601 c013224e c0347ed0 fffffa00 00000601 00000000 c0258460
Nov  5 04:03:21 kernel:        c021b2e4 c0258460 c0391380 c1229540 c122958c 00000002 c04ee888 c28c9108
Nov  5 04:03:21 kernel:        c01322aa 00000601 00000000 c0258460 c021b2e4 c0258460 c0258458 c01325f1
Nov  5 04:03:21 kernel: Call Trace: [try_to_free_inodes+210/264] [grow_inodes+30/384] [get_new_inode+173/280] [get_new_inode+185/280] [iget+88/96] 
[ext2_lookup+84/124] [real_lookup+79/160]
Nov  5 04:03:21 kernel:        [lookup_dentry+296/488] [__namei+40/88] [sys_newlstat+14/96] [system_call+52/56]
Nov  5 04:03:21 kernel: Code: 89 50 04 89 02 8d 53 f8 8b 43 f8 8b 4b fc 89 48 04 89 01 89

Comment 8 Bill Nottingham 2000-11-07 17:45:40 UTC

OK, looks like a kernel bug. Reassigning there. To possibly help debug this:

- what filesystems (types) do you have mounted
- do they all check OK if you force a check?

Comment 9 Trevor Peirce 2000-11-07 19:42:41 UTC

All I have is one ext2 filesystem:

Filesystem            Size  Used Avail Use% Mounted on
/dev/hda1             1.8G  1.0G  683M  61% /

And yes, running fsck agrees that everything is fine.

Comment 10 Dana Hudes 2000-11-26 03:21:03 UTC

I have encountered similar problems. I have a number of very larger (gigabyte-
level) tar files consisting of individually gzip'd files that I am converting 
to bzip2 (saves megabytes). I created a Perl script to extract each file, then 
invoke gzip -d then bzip2 -z .
Previously have encountered situation where when slocate runs while gzip or 
bzip is running on these huge files, the gzip or bzip process is hung and the 
filesystem (ext2, a separate 2Gb partition for this exercise) has been 
corrupted. The only way to stop the gzip/bzip is to reboot -- until then you 
can't access the filesystem in question (/dev/sda12).

Now have hit the wall earlier by starting early. 
extracted flows.20000626_19:56:10-0400.gz 17162135 bytes from flow626PM.tar in 8
8.035609 seconds
decompressed flows.20000626_19:56:10-0400.gz 
result is 0x01ran withsignal 1     
Message from syslogd@harmony at Sat Nov 25 21:34:20 2000 ...
harmony kernel: Unable to handle kernel paging request at virtual address 0100f2
11

Message from syslogd@harmony at Sat Nov 25 21:34:20 2000 ...
harmony kernel: current->tss.cr3 = 07775000, %cr3 = 07775000

Message from syslogd@harmony at Sat Nov 25 21:34:20 2000 ...
harmony kernel: *pde = 00000000


Despite gzip seemingly completing ok, the .gz file is still around.
Here is the ls -l for the last file processed:
rw-------    1 dhudes   dhudes   25985024 Nov 25 21:34 flows.20000626_19:56:10-
0400
-rw-------    1 dhudes   dhudes    3682304 Nov 25 21:35 flows.20000626_19:56:10-
0400.bz2
-rw-r--r--    1 dhudes   dhudes   17162135 Jun 26 19:56 flows.20000626_19:56:10-
0400.gz

I have stopped NFS and nfs locking services. I do have 2 setiathome processes 
running. this machine is 128Mb dual PII-333 .
Samba is running, I'm going to try one more time without it.
while Apache is running (and named), see this vmstat report:
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
 1  0  0    164   4684  82796  17604   0   1    78     2   73    27   0   1  99

(with no setiathome running).

Kernel should not take out the system if it runs out of paging. It should 
probably terminate the offending process (and log the event of course).

Comment 11 stephen 2000-11-30 19:14:03 UTC

I am also having a similar problem.

I get kernel Aiee messages like this every couple days! 
---
svc: unknown version (3)
svc: unknown program 100227 (me 100003)
svc: unknown version (3)
scsi0 channel 0 : resetting for second half of retries.
SCSI bus is being reset for host 0 channel 0.
Unable to handle kernel NULL pointer dereference at virtual address 00000114
current->tss.cr3 = 00101000, %cr3 = 00101000
*pde = 00000000
Oops: 0002
CPU:    0
EIP:    0010:[<e00015b9>]
EFLAGS: 00010002
eax: df42a078   ebx: df42a078   ecx: 00000001   edx: 00000092
esi: 00000000   edi: 00000046   ebp: 00000006   esp: c02514c0
ds: 0018   es: 0018   ss: 0018
Process swapper (pid: 0, process nr: 0, stackpage=c0251000)
Stack: df42a000 df3f5a00 00000000 00000001 dfdd1620 c01b8ad4 df3f5a00 00000001 
       00000001 df3f5a00 00000002 00000027 00000001 c01b86ad df3f5a00 00000001 
       c020a720 00000000 00000000 df42a078 df3f5a00 c01b824c 00000000 df42a000 
Call Trace: [<c01b8ad4>] [<c01b86ad>] [<c020a720>] [<c01b824c>] [<e00022e7>]
[<e000007b>] [<e000248c>] 
       [<c01b8010>] [<c01b67c0>] [<c01b4faf>] [<c01b5073>] [<c01b824c>]
[<c01b871c>] [<c01b824c>] [<e00022e7>] 
       [<c01b8010>] [<c01b67c0>] [<c01b2af5>] [<c01b2bca>] [<c01b824c>]
[<c01bead8>] [<c020b8ed>] [<c01c0578>] 
       [<c01bead8>] [<c01e49dc>] [<c01bf47f>] [<c01e0008>] [<c01bf448>]
[<c01bf47f>] [<c0173fdd>] [<c01727df>] 
       [<c01a3f32>] [<c01be7a5>] [<c01beec5>] [<c01bf056>] [<c01b87dc>]
[<e0005a07>] [<e0005d03>] [<c010c0b6>] 
       [<c0110f72>] [<c010c227>] [<c010b22c>] [<c0108a4d>] [<c0106000>]
[<c0106000>] [<c01001ae>] 
Code: c7 86 14 01 00 00 00 00 08 00 56 8b 86 e8 00 00 00 ff d0 83 
Aiee, killing interrupt handler
Kernel panic: Attempted to kill the idle task!
In swapper task - not syncing
----

My kernel is Linux 2.2.16-3 #6 SMP Tue Nov 14 12:35:47 PST 2000 i686 unknown.

df
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda5               202220    113001     78779  59% /
/dev/sdb1            286735372  14093112 272642260   5% /export
/dev/sda1               907072    832576     74496  92% /dosc
/dev/sda8              1517920    789896    650916  55% /usr

/sbin/lsmod
Module                  Size  Used by
nfs                    28864   1  (autoclean)
nfsd                  144068   8  (autoclean)
autofs                  9248   6  (autoclean)
lockd                  31656   1  (autoclean) [nfs nfsd]
sunrpc                 54084   1  (autoclean) [nfs nfsd lockd]
eepro100               16564   1  (autoclean)
nls_cp437               3876   2  (autoclean)
vfat                    9404   1  (autoclean)
fat                    31104   1  (autoclean) [vfat]
reiserfs              133160   1  (autoclean)
dpt_i2o                96288   6

free
             total       used       free     shared    buffers     cached
Mem:        515084     294096     220988      19964     170420      94120
-/+ buffers/cache:      29556     485528
Swap:       530064          0     530064

cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 3
cpu MHz         : 795.501
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
sep_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 mmx fxsr xmm
bogomips        : 1585.97

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 8
model name      : Pentium III (Coppermine)
stepping        : 3
cpu MHz         : 795.501
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
sep_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 mmx fxsr xmm
bogomips        : 1589.25

Comment 12 Perry Harrington 2000-12-07 08:48:56 UTC

This looks a lot like dodgy memory.  slocate definitely excercises the disk
subsystem.  Try to make a list of oopses and the physical addresses of all the
function calls, it's possible that there is a bank of memory that is bad and is
causing this.  Make sure that you are not overclocking as well, low quality
SDRAM will not tolerate higher speeds for long.

Comment 13 Trevor Peirce 2000-12-07 16:29:42 UTC

My system orignally had 48 MB ram (2x16 and 2x8 chips, all EDO), and about 10 days ago I upgraded it to 80 MB (2x16, 2x32, all EDO) and it's been 
up ever since.... so perhaps the RAM chips were the problem... guess Windoze ME doesn't use them enough to care (that's the 48 is in now).

Comment 14 tom 2000-12-08 14:50:22 UTC

Update from Tom (original poster):  I compiled a custom kernel for the machine that was 

giving me trouble, and I no longer get those errors when slocate runs.  My version came 

out about 60k smaller than 2.2.14-5.0 (I compiled out SCSI support and targeted an 

AMD-K6).

Comment 15 Trevor Peirce 2001-08-07 21:04:58 UTC

The new ram seems to have fixed my problem.  I've been up ever since without any crashes.