Bug 435698

Summary: [5.2][kdump] capture kernel panic at lib/list_debug.c:31
Product: Red Hat Enterprise Linux 5 Reporter: Qian Cai <qcai>
Component: kernelAssignee: Tomas Henzl <thenzl>
Status: CLOSED CURRENTRELEASE QA Contact: Chao Ye <cye>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.2CC: ajb2, bownes, bo.yang, coughlan, cye, duck, jbacik, juanino, phan, revers, stenius, winston.austria
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-04-18 09:37:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 591850    
Attachments:
Description Flags
debug patch to figure out where exactly we're corrupting the list. none

Description Qian Cai 2008-03-03 13:25:55 UTC
Description of problem:
Capture kernel failed at,

...
Scanning logical volumes
  Reading all physical volumes.  This may take a while...
list_add corruption. prev->next should be ffff81000875a550, but was ffff8100086d44e8
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at lib/list_debug.c:31
invalid opcode: 0000 [1] SMP 
last sysfs file: /block/ram0/dev
CPU 0 
Modules linked in: dm_snapshot dm_zero dm_mirror dm_mod ext3 jbd usb_storage
ata_piix libata megaraid_sas sd_mod scsi_mod
Pid: 0, comm: swapper Not tainted 2.6.18-83.el5 #1
RIP: 0010:[<ffffffff80146647>]  [<ffffffff80146647>] __list_add+0x48/0x68
RSP: 0018:ffffffff80416eb0  EFLAGS: 00010086
RAX: 0000000000000058 RBX: ffff81000875a550 RCX: ffffffff8044e560
RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff802eb95c
RBP: ffff8100086d44e8 R08: 0000000000000005 R09: 0000000000000038
R10: ffffffff803ca520 R11: 0000000000000000 R12: ffff8100013d9be8
R13: ffff81000875a4f8 R14: 0000000000000198 R15: 0000000000000199
FS:  0000000000000000(0000) GS:ffffffff8039d000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000456000 CR3: 0000000001404000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffffffff803ce000, task ffffffff802e2ae0)
Stack:  ffff81000875a560 ffff8100013d9bc0 0000000000000082 ffffffff8804211a
 0000000000000000 ffff81000875a564 0000000000000246 ffff81000875a628
 0000000000000000 000000000000000a 0000000000000000 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff8804211a>] :megaraid_sas:megasas_complete_cmd_dpc+0x1f9/0x30d
 [<ffffffff800927f4>] tasklet_action+0x62/0xac
 [<ffffffff80011e47>] __do_softirq+0x5e/0xd6
 [<ffffffff800780f8>] end_level_ioapic_vector+0x9/0x16
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006c55e>] do_softirq+0x2c/0x85
 [<ffffffff8006c3e6>] do_IRQ+0xec/0xf5
 [<ffffffff80056bda>] mwait_idle+0x0/0x4a
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff80056c10>] mwait_idle+0x36/0x4a
 [<ffffffff80048a93>] cpu_idle+0x95/0xb8
 [<ffffffff803d9801>] start_kernel+0x220/0x225
 [<ffffffff803d922f>] _sinittext+0x22f/0x236


Code: 0f 0b 68 12 ba 29 80 c2 1f 00 4c 89 63 08 49 89 1c 24 4c 89 
RIP  [<ffffffff80146647>] __list_add+0x48/0x68
 RSP <ffffffff80416eb0>
 <0>Kernel panic - not syncing: Fatal exception

Version-Release number of selected component (if applicable):
RHEL5.2-Server-20080225.2
kexec-tools - 1.102pre-10.el5.x86_64
kernel - 2.6.18-83.el5.x86_64

How reproducible:
Seen once on dell-pe2900-02.rhts.boston.redhat.com

Steps to Reproduce:
RHTS ltp-kdump LKDTM modules KPTEB (bug in tasklet_action) test case

Additional info:
Full logs here,
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2100695

RHTS recipe ID,
http://rhts.redhat.com/cgi-bin/rhts/recipes.cgi?id=58817

Comment 1 Qian Cai 2008-03-03 13:30:19 UTC
Seen another failure here,
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2100016

Comment 2 Qian Cai 2008-03-17 13:19:37 UTC
Noted here that the above failure was on dell-pe2900-03.rhts.boston.redhat.com,
in case of the rhts link getting expired.

Comment 3 Josef Bacik 2008-05-06 20:53:42 UTC
thats odd, it doesn't look like this can happen on any of the lists that are 
internal to the megaraid driver, as every list action is guarded by a 
spinlock.  Is there a coredump by chance?

Comment 4 Qian Cai 2008-05-07 01:33:16 UTC
I don't think there is a vmcore for capture Kernel panic. Do you think it is
useful to run some SysRq commands after the panic? If so, I might be possible to
send some SysRq keys over serial console there.

Comment 5 Josef Bacik 2008-05-07 13:22:22 UTC
a core dump would be best so i could figure out what list is corrupting, but
sysrq may be helpful so I could see what is running on the other processors and
such.

Comment 6 Josef Bacik 2008-05-07 13:56:27 UTC
ooh jeez i completely missed that this was a kdump kernel panic, sorry about
that.  Sysrq will be fine, I'll take a look at this stuff again.

Comment 7 Josef Bacik 2008-05-07 18:18:43 UTC
how easily is this reproduced?  I think I have a fix, but I want to make sure 
it will be obvious to see that the problem is fixed before we go trying it as 
its kind of a shot in the dark.

Comment 8 Qian Cai 2008-05-08 02:35:34 UTC
The failure rate was pretty high, and the panic was easy to reproduce at least
on those two machines,

dell-pe2900-02.rhts.boston.redhat.com
dell-pe2900-03.rhts.boston.redhat.com

Comment 9 Qian Cai 2008-05-08 05:02:44 UTC
The way to reproduce this is,

wget http://porkchop.devel.redhat.com/qa/rhts/lookaside/ltp-kdump-20080228.tar.gz
tar zxvf ltp-kdump-20080228.tar.gz
cd kdump
export USE_SYMBOL_NAME=1
make
insmod lkdtm.ko cpoint_name=INT_TASKLET_ENTRY cpoint_type=BUG cpoint_count=10

Looks like it could be reproduced by SysRq-C too.

Comment 10 Josef Bacik 2008-05-08 15:39:03 UTC
Created attachment 304872 [details]
debug patch to figure out where exactly we're corrupting the list.

Could you give this patch a whirl and see if it fixes the problem?  If it does
I'd still like to see the logs so I can figure out which of the situations is
happening.

Comment 11 Qian Cai 2008-05-09 04:31:46 UTC
Affected machines are unavailable at the moment.

Comment 12 Josef Bacik 2008-05-19 21:47:47 UTC
ok, let me know when you have a chance to test the patch.

Comment 13 Qian Cai 2008-05-27 08:31:08 UTC
I don't know what have been changed for those machines, but the problem could
not be reproduced for the previous buggy (.83) and the latest RHEL5U2 kernel any
more within the whole day's trying. I'll report there once I have seen another
occurrence.

Comment 14 Josef Bacik 2008-07-14 13:44:05 UTC
ok i'm going to close this, if the problem happens again just reopen.  thanks.

Comment 15 Qian Cai 2008-08-08 06:12:46 UTC
I reopen this BZ, as the problem still exists in RHEL-5.2 GA Kernel,

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=3851039

It just could not be reproduced every time.

Comment 16 Josef Bacik 2009-02-24 17:19:32 UTC
this needs to be reassigned to a megaraid person, though i dont know who that is.

Comment 18 Qian Cai 2009-06-22 07:22:20 UTC
The problem is still seen in RHEL5.4, and also with i386.

megasas_aen_polling[2]: event code 0x0027
megasas_aen_polling[2]: event code 0x0000
list_add corruption. prev->next should be c9e5871c, but was c9d4e960
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:31!
invalid opcode: 0000 [#1]
SMP 
last sysfs file: /block/ram0/dev
Modules linked in: dm_snapshot dm_zero dm_mirror dm_log dm_mod ext3 jbd ata_piix libata megaraid_sas usb_storage sd_mod scsi_mod uhci_hcd ohci_hcd ehci_hcd
CPU:    0
EIP:    0060:[<c20ee305>]    Not tainted VLI
EFLAGS: 00010046   (2.6.18-154.el5PAE #1) 
EIP is at __list_add+0x39/0x52
eax: 00000048   ebx: c9e5871c   ecx: 00000046   edx: 00000000
esi: c9d4e960   edi: c9c40ba0   ebp: c9c40b80   esp: c233df98
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, ti=c233d000 task=c22813c0 task.ti=c22f8000)
Stack: c2245a33 c9e5871c c9d4e960 c9e58724 00000086 00000000 ca86631d 00000286 
       00000285 c9e582e0 00000286 c9e587a0 00000001 00000000 0000000a c20293f7 
       c22f8f74 00000001 c22eeb28 c20292f3 00000000 c22f8f74 c22f8000 00000046 
Call Trace:
 [<ca86631d>] megasas_complete_cmd_dpc+0x255/0x404 [megaraid_sas]
 [<c20293f7>] tasklet_action+0x77/0xf0
 [<c20292f3>] __do_softirq+0x87/0x114
 [<c20073d7>] do_softirq+0x52/0x9c
 [<c204b548>] __do_IRQ+0x0/0xd6
 [<c20074d6>] do_IRQ+0xb5/0xc3
 [<c2005946>] common_interrupt+0x1a/0x20
 [<c2003ce7>] mwait_idle+0x25/0x38
 [<c2003ca8>] cpu_idle+0x9f/0xb9
 [<c22fd9f0>] start_kernel+0x37b/0x383
 =======================
Code: 74 17 50 52 68 f6 59 24 c2 e8 23 6c f3 ff 0f 0b 1a 00 a8 59 24 c2 83 c4 0c 8b 06 39 d8 74 17 50 53 68 33 5a 24 c2 e8 06 6c f3 ff <0f> 0b 1f 00 a8 59 24 c2 83 c4 0c 89 7b 04 89 1f 89 77 04 89 3e 
EIP: [<c20ee305>] __list_add+0x39/0x52 SS:ESP 0068:c233df98
 <0>Kernel panic - not syncing: Fatal exception in interrupt

Comment 21 Tom Coughlan 2009-09-10 18:16:55 UTC
CC'ing LSI. 

Have your guys see this? Any suggestions for a fix?

Comment 22 Tomas Henzl 2010-01-15 16:26:02 UTC
Moving it to RHEL5.6, no chance for RHEL5.5.

Comment 23 Tomas Henzl 2010-01-15 16:28:37 UTC
This one looks similar to bz#499876, there is a test kernel https://bugzilla.redhat.com/show_bug.cgi?id=499876#c16
if someone would like to give it a chance here also?

Comment 24 bob bownes 2010-03-03 18:57:23 UTC
I am also seeing this bug in 5.4

root@usageb02 ~ # release
Red Hat Enterprise Linux Server release 5.4 (Tikanga)
root@usageb02 ~ # uname -a
Linux usageb02 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
root@usageb02 ~ #


 lockd_down: lockd failed to exit, clearing pid^M
<157>Feb 28 19:55:06 usageb02 lockd_up: no pid, 2 users??^M
<157>Feb 28 19:55:06 usageb02 list_add corruption. prev->next should be ffff81042fe117d8, but was 0000000000000000^M
<157>Feb 28 19:55:06 usageb02 ----------- [cut here ] --------- [please bite here ] ---------^M
<157>Feb 28 19:55:06 usageb02 Kernel BUG at lib/list_debug.c:31^M
<157>Feb 28 19:55:06 usageb02 invalid opcode: 0000 [1] SMP ^M
<157>Feb 28 19:55:06 usageb02 last sysfs file: /class/fc_remote_ports/rport-1:0-9/scsi_target_id^M
<157>Feb 28 19:55:06 usageb02 CPU 3 ^M
<157>Feb 28 19:55:06 usageb02 Modules linked in: nfs fscache nfs_acl mptctl mptbase ipmi_si(U) ipmi_devintf(U) ipmi_msghandler(U) ipv6 xfrm_nalgo crypto_api autofs4 lockd sunrpc dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport i5000_edac shpchp edac_mc bnx2 pcspkr sg hpilo serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd^M
<157>Feb 28 19:55:06 usageb02 Pid: 16907, comm: lockd Tainted: G      2.6.18-164.el5 #1^M
<157>Feb 28 19:55:06 usageb02 RIP: 0010:[<ffffffff80151298>]  [<ffffffff80151298>] __list_add+0x48/0x68^M
<157>Feb 28 19:55:06 usageb02 RSP: 0000:ffff81018f45ded0  EFLAGS: 00010082^M
<157>Feb 28 19:55:06 usageb02 RAX: 0000000000000058 RBX: ffff81042fe117d8 RCX: 0000000000000086^M
<157>Feb 28 19:55:06 usageb02 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff80308c5c^M
<157>Feb 28 19:55:06 usageb02 RBP: ffffffff88444900 R08: 00000000000000a0 R09: 000000000000003c^M
<157>Feb 28 19:55:06 usageb02 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff88444900^M
<157>Feb 28 19:55:06 usageb02 R13: 00000002f3df20ed R14: 0000000000000000 R15: ffff810214a96e80^M
<157>Feb 28 19:55:06 usageb02 FS:  00002b1d187ae6e0(0000) GS:ffff81042ff26640(0000) knlGS:0000000000000000^M
<157>Feb 28 19:55:06 usageb02 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
<157>Feb 28 19:55:06 usageb02 CR2: 00007ffff470c000 CR3: 000000030659d000 CR4: 00000000000006e0^M
<157>Feb 28 19:55:06 usageb02 Process lockd (pid: 16907, threadinfo ffff81018f45c000, task ffff81042f46c860)^M
<157>Feb 28 19:55:06 usageb02 Stack:  ffffffff88444900 ffff81042fe10000 ffff81042fe10000 ffffffff8001ca64^M
<157>Feb 28 19:55:06 usageb02  ffff81018f45df20 0000000000000286 00000002f3df20ed ffff81041fb8e000^M
<157>Feb 28 19:55:06 usageb02  ffff81003b744140 0000000000000003 ffffffff8842e1cd ffffffff8842e2de^M
<157>Feb 28 19:55:06 usageb02 Call Trace:^M
<157>Feb 28 19:55:06 usageb02  [<ffffffff8001ca64>] __mod_timer+0xa3/0xbe^M
<157>Feb 28 19:55:06 usageb02  [<ffffffff8842e1cd>] :lockd:lockd+0x0/0x2bf^M
<157>Feb 28 19:55:06 usageb02  [<ffffffff8842e2de>] :lockd:lockd+0x111/0x2bf^M
<157>Feb 28 19:55:06 usageb02  [<ffffffff8005dfb1>] child_rip+0xa/0x11^M
<157>Feb 28 19:55:06 usageb02  [<ffffffff8842e1cd>] :lockd:lockd+0x0/0x2bf^M
<157>Feb 28 19:55:06 usageb02  [<ffffffff8842e1cd>] :lockd:lockd+0x0/0x2bf^M
<157>Feb 28 19:55:06 usageb02  [<ffffffff8005dfa7>] child_rip+0x0/0x11^M
<157>Feb 28 19:55:06 usageb02 ^M
<157>Feb 28 19:55:06 usageb02 ^M
<157>Feb 28 19:55:06 usageb02 Code: 0f 0b 68 e9 41 2b 80 c2 1f 00 4c 89 63 08 49 89 1c 24 4c 89 ^M
<157>Feb 28 19:55:06 usageb02 RIP  [<ffffffff80151298>] __list_add+0x48/0x68^M
<157>Feb 28 19:55:06 usageb02  RSP <ffff81018f45ded0>^M

Comment 29 Tomas Henzl 2010-09-16 13:01:19 UTC
The update to version 4.31 is included in kernel -210, chances are that this could mend the issue. Could you retest with latest kernel?
Thanks, Tomas

Comment 30 Han Pingtian 2010-09-28 09:05:39 UTC
I cannot reproduce this bug with -83 kernel. I have tried about 39 times.

Comment 31 Tomas Henzl 2010-09-29 10:34:17 UTC
Thanks for testing.

We can't reproduce this and anyhow expect that the bz#564249 has solved this.

I'm closing this one as dup of bz#564249

*** This bug has been marked as a duplicate of bug 564249 ***

Comment 32 Han Pingtian 2010-12-22 07:48:12 UTC
We have reproduced it with 5.6 RC1, kernel version is 2.6.18-237.el5:

https://beaker.engineering.redhat.com/recipes/84092

... ...
SysRq : Trigger a crashdump 
Memory for crash kernel (0x0 to 0x0) notwithin permissible range 
?Mounting proc filesystem 
Mounting sysfs filesystem 
Creating /dev 
Creating initial device nodes 
Loading ehci-hcd.ko module 
Loading ohci-hcd.ko module 
Loading uhci-hcd.ko module 
Loading scsi_mod.ko module 
Loading sd_mod.ko module 
Loading megaraid_sas.ko module 
Loading libata.ko module 
Loading ata_piix.ko module 
Loading usb-storage.ko module 
Waiting 8 seconds for driver initialization. 
list_add corruption. prev->next should be ffff81000146aa60, but was ffff810008c0d2e8 
----------- [cut here ] --------- [please bite here ] --------- 
Kernel BUG at lib/list_debug.c:31 
invalid opcode: 0000 [1] SMP  
last sysfs file:  
CPU 0  
Modules linked in: usb_storage ata_piix libata megaraid_sas sd_mod scsi_mod uhci_hcd ohci_hcd ehci_hcd 
Pid: 0, comm: swapper Not tainted 2.6.18-237.el5 #1 
RIP: 0010:[<ffffffff80157c0e>]  [<ffffffff80157c0e>] __list_add+0x48/0x68 
RSP: 0000:ffffffff804a5eb0  EFLAGS: 00010086 
RAX: 0000000000000058 RBX: ffff81000146aa60 RCX: 0000000000000082 
RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff80319f5c 
RBP: ffff810008c0d2e8 R08: 0000000000000005 R09: 0000000000000038 
R10: ffffffff804525a0 R11: 0000000000000000 R12: ffff810008d06be8 
R13: ffff81000146a4f8 R14: 0000000000000194 R15: 0000000000000195 
FS:  0000000000000000(0000) GS:ffffffff80425000(0000) knlGS:0000000000000000 
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b 
CR2: 00007fffeaacbfd8 CR3: 0000000008caa000 CR4: 00000000000006e0 
Process swapper (pid: 0, threadinfo ffffffff80456000, task ffffffff80310b60) 
Stack:  ffff81000146aa70 ffff810008d06bc0 0000000000000046 ffffffff88073fa6 
 0000000000000000 ffff81000146aa78 0000000000000282 ffff81000146ab40 
 0000000000000001 0000000000000000 000000000000000a 0000000000000000 
Call Trace: 
 <IRQ>  [<ffffffff88073fa6>] :megaraid_sas:megasas_complete_cmd_dpc+0x30d/0x4da 
 [<ffffffff800967cc>] tasklet_action+0x89/0x125 
 [<ffffffff80012464>] __do_softirq+0x89/0x133 
 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 
 [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d 
 [<ffffffff8006d485>] do_IRQ+0xec/0xf5 
 [<ffffffff80057020>] mwait_idle+0x0/0x20 
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa 
 <EOI>  [<ffffffff8006b981>] mwait_idle_with_hints+0x66/0x67 
 [<ffffffff8005702c>] mwait_idle+0xc/0x20 
 [<ffffffff800492fd>] cpu_idle+0x95/0xb8 
 [<ffffffff80461807>] start_kernel+0x220/0x225 
 [<ffffffff8046122f>] _sinittext+0x22f/0x236 
 
 
Code: 0f 0b 68 cd 3b 2c 80 c2 1f 00 4c 89 63 08 49 89 1c 24 4c 89  
RIP  [<ffffffff80157c0e>] __list_add+0x48/0x68 
 RSP <ffffffff804a5eb0> 
 <0>Kernel panic - not syncing: Fatal exception

Comment 33 Han Pingtian 2010-12-22 07:49:11 UTC
We were testing dumping to raw partition when triggering this problem.

Comment 34 Tomas Henzl 2011-01-03 10:54:10 UTC
(In reply to comment #33)
> We were testing dumping to raw partition when triggering this problem.

Is this important - I mean can you reproduce it on raw partition, is it reproducible?

Comment 36 Chao Ye 2011-01-25 06:10:42 UTC
(In reply to comment #34)
> (In reply to comment #33)
> > We were testing dumping to raw partition when triggering this problem.
> 
> Is this important - I mean can you reproduce it on raw partition, is it
> reproducible?

I can't reproduce this issue on dell-pe2900-02.rhts.eng.bos.redhat.com via dump to raw target. I tried more than 20 times, all dump successfully. No kernel panic found:
https://beaker.engineering.redhat.com/recipes/98483

Comment 37 Chao Ye 2011-01-25 06:18:38 UTC
(In reply to comment #36)
> I can't reproduce this issue on dell-pe2900-02.rhts.eng.bos.redhat.com via dump
> to raw target. I tried more than 20 times, all dump successfully. No kernel
> panic found:
> https://beaker.engineering.redhat.com/recipes/98483
Test was executed on RHEL5-Server-U6, with kernel-2.6.18-238.el5.i686 kexec-tools-1.102pre-126.el5.i386

Comment 38 Qian Cai 2011-04-18 09:37:22 UTC
Looks like not reproducible anymore.