In 2 occations, under very heavy exec load, I got Negative d_count (-1) for bin/gcc Unable to handle kernel NULL pointer dereference at virtual address 00000000 current->tss.cr3 = 00101000, %cr3 = 00101000 *pde = 00000000 Oops: 0002 CPU: 0 EIP: 0010:[<c0138fdd>] EFLAGS: 00010286 eax: 00000025 ebx: e76a9080 ecx: 00000002 edx: 0000003c esi: ffffffff edi: 0805b000 ebp: c1c69a80 esp: f0ed9eec ds: 0018 es: 0018 ss: 0018 It happened on both beta5 and rc1 on 2 different SMP machines with 2GB RAM and 4GB RAM, respectively. I don't know how to trigger it reliably. Could someone who is familiar with SMP and fs please take a look at http://boudicca.tux.org/hypermail/linux-kernel/2000week11/0189.html It sounds very similar to what happened to me. My SMP machines were under very heavy exec load. H.J.
This defect is considered MUST-FIX for Winston Gold-release
Do you have a decoded oops log?
Al says that it would take ~300K of patches to the VFS to fix this; he said that most of his VFS "threading" work was actually fixing races and very little by comparison truly threading. We can't fix this for 7.0, unfortunately.
If it won't be fixed in 7.0, can you comment out *(int *)0 = 0; from out: if (count >= 0) { dentry->d_count = count; return; } printk(KERN_CRIT "Negative d_count (%d) for %s/%s\n", count, dentry->d_parent->d_name.name, dentry->d_name.name); *(int *)0 = 0; in dcache () in fs/dcache.c?
Possible worse damage... :-(
Something seems wrong since 2.2.16-17 and getting worse in 2.2.16-21. I never saw this problem on the same SMP machine basically doing the same stuff. Now I am seeing it almost everytime when I do a parallel build of gcc. Maybe some changes since 2.2.16-17 aggravate the problem.
I don't think that we've changed anything that would make this worse. Have you re-installed the old kernels you claim were better to see if the problem gets better?
It may take a while to verify if all possible. Another data point, all those machines have 4 or more harddrives with many patitions between those harddrives. Looking through the change from 2.2.16-12 to 2.2.16-17, will linux-2.2.16-sard.patch cause the problem on SMP machines with many harddrives/partitions.
No, sard reports information, but shouldn't have any affect on this.
You are right. After backing sard out, I got it again. I will see what I can find out. It will take me a while to get anywhere.
After some investigation, it seems that I was running a wrong kernel. After rebooting the right kernel, I have been runnning my load test for several hours now. Everything seems ok. So I close it for now. Sorry for that. Thanks.