Bug 190063
Summary: | kernel BUG at include/linux/list.h:167! | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Sandro Casali <sandro.casali> | ||||||
Component: | kernel | Assignee: | Dave Jones <davej> | ||||||
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Brian Brock <bbrock> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 5 | CC: | konradr, pfrields, wtogami | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | i686 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2006-11-24 23:07:13 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Sandro Casali
2006-04-27 08:41:47 UTC
Created attachment 128292 [details]
Output on console from the startup to the crash
Coupld of questions: a) Which IBM server is it. b) What is the BIOS level. Has it been upgraded? c) What is the ServerRAID firmware level? Has it been upgraded? d) What is the SCSI hard-disks firmware level? Has it been upgraded? Was this a problem with previous versions of FC? Or RHEL? Thanks! It's weird how your dmesg is all jumbled around when aacraid get's loaded. I wouldn't have expected that right there but maybe it's because of the serial console. I've seen this bug with aacraid myself. list.h line 167 is debug code in list_del(). I took a look through the aacraid looking for possible race conditions with list_del(). One possible race conditions is maybe the list_del() in aac_intr_normal() in drivers/scsi/aacraid/dpcsup.c http://sosdg.org/~coywolf/lxr/source/drivers/scsi/aacraid/dpcsup.c?v=2.6.15#L287 The other list_del()'s are protected by spin_lock_irqsave() but that one isn't. (In reply to comment #3) > It's weird how your dmesg is all jumbled around when aacraid get's loaded. I > wouldn't have expected that right there but maybe it's because of the serial > console. > > I've seen this bug with aacraid myself. > > list.h line 167 is debug code in list_del(). > > I took a look through the aacraid looking for possible race conditions with > list_del(). One possible race conditions is maybe the list_del() in > aac_intr_normal() in drivers/scsi/aacraid/dpcsup.c > > http://sosdg.org/~coywolf/lxr/source/drivers/scsi/aacraid/dpcsup.c?v=2.6.15#L287 > > The other list_del()'s are protected by spin_lock_irqsave() but that one isn't. > > wich is your tip ? thank you and sorry for my english. (In reply to comment #2) > Coupld of questions: > a) Which IBM server is it. > b) What is the BIOS level. Has it been upgraded? > c) What is the ServerRAID firmware level? Has it been upgraded? > d) What is the SCSI hard-disks firmware level? Has it been upgraded? > > Was this a problem with previous versions of FC? Or RHEL? > > Thanks! My responses: a) IBM eserver xSeries 260 type 8865 b) BIOS version 1.0 date 08/11/05 build ZUE140AUS (not upgraded) c) Adaptec SAS RAID BIOS V5.0-2 build 8264 (upgraded) d) How can verify this ??? I don't know if was this problem with previus version of FC? or RHEL? Thank you and sorry for my english. (In reply to comment #5) > (In reply to comment #2) > > My responses: > a) IBM eserver xSeries 260 type 8865 > b) BIOS version 1.0 date 08/11/05 build ZUE140AUS (not upgraded) You might want to update it. > c) Adaptec SAS RAID BIOS V5.0-2 build 8264 (upgraded) > d) How can verify this ??? During the POST, you will see the Adaptec RAID controller enumerating the SAS devices. The right last column should have a four string characters, such as S512 or S516.. Make sure that _ALL_ of them are the right revision and if they are not, download the ServerRAID Xpress Update CD. (In reply to comment #6) > (In reply to comment #5) > > (In reply to comment #2) > > > > > My responses: > > a) IBM eserver xSeries 260 type 8865 > > b) BIOS version 1.0 date 08/11/05 build ZUE140AUS (not upgraded) > > You might want to update it. > > > c) Adaptec SAS RAID BIOS V5.0-2 build 8264 (upgraded) > > d) How can verify this ??? > > During the POST, you will see the Adaptec RAID controller enumerating the SAS > devices. The right last column should have a four string characters, such as > S512 or S516.. > > Make sure that _ALL_ of them are the right revision and if they are not, > download the ServerRAID Xpress Update CD. I have upgrade all but the result is the same. Created attachment 128595 [details]
.config of my recompiled kernel
(In reply to comment #3) > It's weird how your dmesg is all jumbled around when aacraid get's loaded. I > wouldn't have expected that right there but maybe it's because of the serial > console. > > I've seen this bug with aacraid myself. > > list.h line 167 is debug code in list_del(). > > I took a look through the aacraid looking for possible race conditions with > list_del(). One possible race conditions is maybe the list_del() in > aac_intr_normal() in drivers/scsi/aacraid/dpcsup.c > > http://sosdg.org/~coywolf/lxr/source/drivers/scsi/aacraid/dpcsup.c?v=2.6.15#L287 > > The other list_del()'s are protected by spin_lock_irqsave() but that one isn't. > > I have try to run the machine with the normal kernel (not smp) and with this kernel the problem is disappeared but the kernel use only 1 cpu and 4 GB of RAM. So i have recompiled the kernel using the .config of the normal kernel (configs/kernel-2.6.16-i686.config) updated with the SMP support enabled with 8 CPUs and High Memory Support enabled to 64GB. With this recompiled kernel the machine is running without apparent problems from more the 2 days. What is the physical amount of memory? Also, try using the -largesmp kernel - that should have the support for huge configuration. (In reply to comment #10) > What is the physical amount of memory? > > Also, try using the -largesmp kernel - that should have the support for huge > configuration. The physical amount of memory is 8GB. What is the -largesmp kernel? The reason why the kernel.org kernel works is because it doesn't have the debug check in list_del(). That's only in fedora and the -mm kernel. I'm surprised that this race condition was never caught before. It seems like it would lead to corruption pretty quickly... (In reply to comment #12) > The reason why the kernel.org kernel works is because it doesn't have the debug > check in list_del(). That's only in fedora and the -mm kernel. > > I'm surprised that this race condition was never caught before. It seems like > it would lead to corruption pretty quickly... I'm not used a kernel.org (vanilla) kernel, but the fedora source kernel builded from http://download.fedora.redhat.com/pub/fedora/linux/core/updates/5/SRPMS/kernel-2.6.16-1.2096_FC5.src.rpm as described on http://fedora.redhat.com/docs/release-notes/fc5/#id2918351 Has the most recent kernel fixed your problem? (In reply to comment #14) > Has the most recent kernel fixed your problem? NO. After some days, the problem was occurred also with the kernel described in the comment #9. The same problem i have with x86_64 version. Currently i am running without apparent problem, with a kernel builded from a vanilla source with a .config (kernel-2.6.16-x86_64.config) both taken from kernel-2.6.16-1.2122_FC5.src.rpm. I have this problem with the 2.6.16-1.2133_FC5 as well. Konrad, is there any way you could contact Adaptec about this? Their out of tree aacraid-dkms-1.1.5-2423.tgz has patches to fix this and it works. If you grep through their new code for "RMQ" you can see where they've changed how they call list_del(). I don't understand the code that well so I don't know what the issues are but it seems like we're putting a lot of effort into it if Adaptec already has a fix. A new kernel update has been released (Version: 2.6.18-1.2200.fc5) based upon a new upstream kernel release. Please retest against this new kernel, as a large number of patches go into each upstream release, possibly including changes that may address this problem. This bug has been placed in NEEDINFO state. Due to the large volume of inactive bugs in bugzilla, if this bug is still in this state in two weeks time, it will be closed. Should this bug still be relevant after this period, the reporter can reopen the bug at any time. Any other users on the Cc: list of this bug can request that the bug be reopened by adding a comment to the bug. In the last few updates, some users upgrading from FC4->FC5 have reported that installing a kernel update has left their systems unbootable. If you have been affected by this problem please check you only have one version of device-mapper & lvm2 installed. See bug 207474 for further details. If this bug is a problem preventing you from installing the release this version is filed against, please see bug 169613. If this bug has been fixed, but you are now experiencing a different problem, please file a separate bug for the new problem. Thank you. This bug has been mass-closed along with all other bugs that have been in NEEDINFO state for several months. Due to the large volume of inactive bugs in bugzilla, this is the only method we have of cleaning out stale bug reports where the reporter has disappeared. If you can reproduce this bug after installing all the current updates, please reopen this bug. If you are not the reporter, you can add a comment requesting it be reopened, and someone will get to it asap. Thank you. |