Bug 190063

Summary: kernel BUG at include/linux/list.h:167!
Product: [Fedora] Fedora Reporter: Sandro Casali <sandro.casali>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 5CC: konradr, pfrields, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-11-24 23:07:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Output on console from the startup to the crash
none
.config of my recompiled kernel none

Description Sandro Casali 2006-04-27 08:41:47 UTC
Description of problem:

I have installed Fedora FC5 on a IBM server and upgraded with yum.
The system start and after a lot of time i have the error "kernel BUG at
include/linux/list.h:167!" and the system crash.
I have attached the output on console from the startup to the crash.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Sandro Casali 2006-04-27 08:41:48 UTC
Created attachment 128292 [details]
Output on console from the startup to the crash

Comment 2 Konrad Rzeszutek 2006-04-27 14:53:01 UTC
Coupld of questions:
 a) Which IBM server is it.
 b) What is the BIOS level. Has it been upgraded?
 c) What is the ServerRAID firmware level? Has it been upgraded?
 d) What is the SCSI hard-disks firmware level? Has it been upgraded?

Was this a problem with previous versions of FC? Or RHEL?

Thanks!

Comment 3 Dan Carpenter 2006-04-27 16:12:52 UTC
It's weird how your dmesg is all jumbled around when aacraid get's loaded.  I
wouldn't have expected that right there but maybe it's because of the serial
console.

I've seen this bug with aacraid myself.

list.h line 167 is debug code in list_del().

I took a look through the aacraid looking for possible race conditions with
list_del().  One possible race conditions is maybe the list_del() in
aac_intr_normal() in drivers/scsi/aacraid/dpcsup.c 

http://sosdg.org/~coywolf/lxr/source/drivers/scsi/aacraid/dpcsup.c?v=2.6.15#L287

The other list_del()'s are protected by spin_lock_irqsave() but that one isn't.



Comment 4 Sandro Casali 2006-04-28 09:57:18 UTC
(In reply to comment #3)
> It's weird how your dmesg is all jumbled around when aacraid get's loaded.  I
> wouldn't have expected that right there but maybe it's because of the serial
> console.
> 
> I've seen this bug with aacraid myself.
> 
> list.h line 167 is debug code in list_del().
> 
> I took a look through the aacraid looking for possible race conditions with
> list_del().  One possible race conditions is maybe the list_del() in
> aac_intr_normal() in drivers/scsi/aacraid/dpcsup.c 
> 
> http://sosdg.org/~coywolf/lxr/source/drivers/scsi/aacraid/dpcsup.c?v=2.6.15#L287
> 
> The other list_del()'s are protected by spin_lock_irqsave() but that one isn't.
> 
> 

wich is your tip ?
thank you and sorry for my english.

Comment 5 Sandro Casali 2006-04-28 10:15:36 UTC
(In reply to comment #2)
> Coupld of questions:
>  a) Which IBM server is it.
>  b) What is the BIOS level. Has it been upgraded?
>  c) What is the ServerRAID firmware level? Has it been upgraded?
>  d) What is the SCSI hard-disks firmware level? Has it been upgraded?
> 
> Was this a problem with previous versions of FC? Or RHEL?
> 
> Thanks!

My responses:
  a) IBM eserver xSeries 260 type 8865
  b) BIOS version 1.0 date 08/11/05 build ZUE140AUS (not upgraded)
  c) Adaptec SAS RAID BIOS V5.0-2 build 8264 (upgraded)
  d) How can verify this ???

I don't know if was this problem with previus version of FC? or RHEL?

Thank you and sorry for my english.

Comment 6 Konrad Rzeszutek 2006-04-28 15:21:45 UTC
(In reply to comment #5)
> (In reply to comment #2)

> 
> My responses:
>   a) IBM eserver xSeries 260 type 8865
>   b) BIOS version 1.0 date 08/11/05 build ZUE140AUS (not upgraded)

You might want to update it.

>   c) Adaptec SAS RAID BIOS V5.0-2 build 8264 (upgraded)
>   d) How can verify this ???

During the POST, you will see the Adaptec RAID controller enumerating the SAS
devices. The right last column should have a four string characters, such as
S512 or S516..

Make sure that _ALL_ of them are the right revision and if they are not,
download the ServerRAID Xpress Update CD.

Comment 7 Sandro Casali 2006-05-04 10:08:26 UTC
(In reply to comment #6)
> (In reply to comment #5)
> > (In reply to comment #2)
> 
> > 
> > My responses:
> >   a) IBM eserver xSeries 260 type 8865
> >   b) BIOS version 1.0 date 08/11/05 build ZUE140AUS (not upgraded)
> 
> You might want to update it.
> 
> >   c) Adaptec SAS RAID BIOS V5.0-2 build 8264 (upgraded)
> >   d) How can verify this ???
> 
> During the POST, you will see the Adaptec RAID controller enumerating the SAS
> devices. The right last column should have a four string characters, such as
> S512 or S516..
> 
> Make sure that _ALL_ of them are the right revision and if they are not,
> download the ServerRAID Xpress Update CD.

I have upgrade all but the result is the same.

Comment 8 Sandro Casali 2006-05-04 10:36:44 UTC
Created attachment 128595 [details]
.config of my recompiled kernel

Comment 9 Sandro Casali 2006-05-04 10:39:21 UTC
(In reply to comment #3)
> It's weird how your dmesg is all jumbled around when aacraid get's loaded.  I
> wouldn't have expected that right there but maybe it's because of the serial
> console.
> 
> I've seen this bug with aacraid myself.
> 
> list.h line 167 is debug code in list_del().
> 
> I took a look through the aacraid looking for possible race conditions with
> list_del().  One possible race conditions is maybe the list_del() in
> aac_intr_normal() in drivers/scsi/aacraid/dpcsup.c 
> 
> http://sosdg.org/~coywolf/lxr/source/drivers/scsi/aacraid/dpcsup.c?v=2.6.15#L287
> 
> The other list_del()'s are protected by spin_lock_irqsave() but that one isn't.
> 
> 


I have try to run the machine with the normal kernel (not smp) and with this
kernel the problem is disappeared but the kernel use only 1 cpu and 4 GB of RAM.

So i have recompiled the kernel using the .config of the normal kernel
(configs/kernel-2.6.16-i686.config) updated with the SMP support enabled with 8
CPUs and High Memory Support enabled to 64GB.
With this recompiled kernel the machine is running without apparent problems
from more the 2 days.


Comment 10 Konrad Rzeszutek 2006-05-04 19:46:51 UTC
What is the physical amount of memory?

Also, try using the -largesmp kernel - that should have the support for huge
configuration.

Comment 11 Sandro Casali 2006-05-04 20:41:31 UTC
(In reply to comment #10)
> What is the physical amount of memory?
> 
> Also, try using the -largesmp kernel - that should have the support for huge
> configuration.

The physical amount of memory is 8GB.

What is the -largesmp kernel?


Comment 12 Dan Carpenter 2006-05-04 23:29:20 UTC
The reason why the kernel.org kernel works is because it doesn't have the debug
check in list_del().  That's only in fedora and the -mm kernel.

I'm surprised that this race condition was never caught before.  It seems like
it would lead to corruption pretty quickly...

Comment 13 Sandro Casali 2006-05-05 08:22:20 UTC
(In reply to comment #12)
> The reason why the kernel.org kernel works is because it doesn't have the debug
> check in list_del().  That's only in fedora and the -mm kernel.
> 
> I'm surprised that this race condition was never caught before.  It seems like
> it would lead to corruption pretty quickly...

I'm not used a kernel.org (vanilla) kernel, but the fedora source kernel builded
from
http://download.fedora.redhat.com/pub/fedora/linux/core/updates/5/SRPMS/kernel-2.6.16-1.2096_FC5.src.rpm
as described on http://fedora.redhat.com/docs/release-notes/fc5/#id2918351


Comment 14 Konrad Rzeszutek 2006-06-16 20:17:36 UTC
Has the most recent kernel fixed your problem?

Comment 15 Sandro Casali 2006-06-20 19:39:28 UTC
(In reply to comment #14)
> Has the most recent kernel fixed your problem?

NO.

After some days, the problem was occurred also with the kernel described in 
the comment #9.

The same problem i have with x86_64 version.

Currently i am running without apparent problem, with a kernel builded from a 
vanilla source with a .config (kernel-2.6.16-x86_64.config) both taken from 
kernel-2.6.16-1.2122_FC5.src.rpm.


Comment 16 Dan Carpenter 2006-06-20 22:16:31 UTC
I have this problem with the 2.6.16-1.2133_FC5 as well.  

Konrad, is there any way you could contact Adaptec about this?  Their out of
tree aacraid-dkms-1.1.5-2423.tgz has patches to fix this and it works.  If you
grep through their new code for "RMQ" you can see where they've changed how they
call list_del().  

I don't understand the code that well so I don't know what the issues are but it
seems like we're putting a lot of effort into it if Adaptec already has a fix.



Comment 17 Dave Jones 2006-10-17 00:41:08 UTC
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 18 Dave Jones 2006-11-24 23:07:13 UTC
This bug has been mass-closed along with all other bugs that
have been in NEEDINFO state for several months.

Due to the large volume of inactive bugs in bugzilla, this
is the only method we have of cleaning out stale bug reports
where the reporter has disappeared.

If you can reproduce this bug after installing all the
current updates, please reopen this bug.

If you are not the reporter, you can add a comment requesting
it be reopened, and someone will get to it asap.

Thank you.