Bug 16105 - Negative d_count (-1) under very heavy exec load
Summary: Negative d_count (-1) under very heavy exec load
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: kernel
Version: 7.0
Hardware: i386
OS: Linux
high
high
Target Milestone: ---
Assignee: Michael K. Johnson
QA Contact:
URL: http://boudicca.tux.org/hypermail/lin...
Whiteboard: Winston gold
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2000-08-13 18:11 UTC by hjl
Modified: 2008-05-01 15:37 UTC (History)
0 users

(edit)
Clone Of:
(edit)
Last Closed: 2000-08-25 02:34:21 UTC


Attachments (Terms of Use)

Description Red Hat Bugzilla 2000-08-13 18:11:07 UTC
In 2 occations, under very heavy exec load, I got

Negative d_count (-1) for bin/gcc
Unable to handle kernel NULL pointer dereference at virtual address
00000000
current->tss.cr3 = 00101000, %cr3 = 00101000
*pde = 00000000
Oops: 0002
CPU:    0
EIP:    0010:[<c0138fdd>]
EFLAGS: 00010286
eax: 00000025   ebx: e76a9080   ecx: 00000002   edx: 0000003c
esi: ffffffff   edi: 0805b000   ebp: c1c69a80   esp: f0ed9eec
ds: 0018   es: 0018   ss: 0018

It happened on both beta5 and rc1 on 2 different SMP machines with 2GB RAM
and 4GB RAM, respectively. I don't know how to trigger it reliably.

Could someone who is familiar with SMP and fs please take a look at

http://boudicca.tux.org/hypermail/linux-kernel/2000week11/0189.html

It sounds very similar to what happened to me. My SMP machines were
under very heavy exec load.

H.J.

Comment 1 Red Hat Bugzilla 2000-08-13 18:59:02 UTC
This defect is considered MUST-FIX for Winston Gold-release

Comment 2 Red Hat Bugzilla 2000-08-14 20:13:27 UTC
Do you have a decoded oops log?

Comment 3 Red Hat Bugzilla 2000-08-16 22:02:01 UTC
Al says that it would take ~300K of patches to the VFS to fix this;
he said that most of his VFS "threading" work was actually fixing
races and very little by comparison truly threading.  We can't fix
this for 7.0, unfortunately.

Comment 4 Red Hat Bugzilla 2000-08-20 18:21:30 UTC
If it won't be fixed in 7.0, can you comment out

	*(int *)0 = 0;

from

out:
        if (count >= 0) { 
                dentry->d_count = count;
                return;
        }

        printk(KERN_CRIT "Negative d_count (%d) for %s/%s\n",
                count,
                dentry->d_parent->d_name.name,
                dentry->d_name.name);
        *(int *)0 = 0;  

in dcache () in fs/dcache.c?


Comment 5 Red Hat Bugzilla 2000-08-20 21:00:51 UTC
Possible worse damage... :-(

Comment 6 Red Hat Bugzilla 2000-08-24 21:45:43 UTC
Something seems wrong since 2.2.16-17 and getting worse in 2.2.16-21. I never
saw this problem on the same SMP machine basically doing the same stuff.
Now I am seeing it almost everytime when I do a parallel build of gcc. Maybe
some changes since 2.2.16-17 aggravate the problem.

Comment 7 Red Hat Bugzilla 2000-08-24 21:56:05 UTC
I don't think that we've changed anything that would make this
worse.  Have you re-installed the old kernels you claim were
better to see if the problem gets better?


Comment 8 Red Hat Bugzilla 2000-08-24 22:15:37 UTC
It may take a while to verify if all possible. Another data point,
all those machines have 4 or more harddrives with many patitions
between those harddrives. Looking through the change from 2.2.16-12
to 2.2.16-17, will linux-2.2.16-sard.patch cause the problem on SMP
machines with many harddrives/partitions.

Comment 9 Red Hat Bugzilla 2000-08-24 22:28:02 UTC
No, sard reports information, but shouldn't have any affect on this.

Comment 10 Red Hat Bugzilla 2000-08-25 02:34:19 UTC
You are right. After backing sard out, I got it again. I will see what
I can find out. It will take me a while to get anywhere.

Comment 11 Red Hat Bugzilla 2000-08-25 20:35:53 UTC
After some investigation, it seems that I was running a wrong kernel.
After rebooting the right kernel, I have been runnning my load test
for several hours now. Everything seems ok. So I close it for now.

Sorry for that. Thanks.


Note You need to log in before you can comment on or make changes to this bug.