Bug 435698
Summary: | [5.2][kdump] capture kernel panic at lib/list_debug.c:31 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Qian Cai <qcai> | ||||
Component: | kernel | Assignee: | Tomas Henzl <thenzl> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Chao Ye <cye> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 5.2 | CC: | ajb2, bownes, bo.yang, coughlan, cye, duck, jbacik, juanino, phan, revers, stenius, winston.austria | ||||
Target Milestone: | rc | Keywords: | Reopened | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2011-04-18 09:37:22 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 591850 | ||||||
Attachments: |
|
Description
Qian Cai
2008-03-03 13:25:55 UTC
Seen another failure here, http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2100016 Noted here that the above failure was on dell-pe2900-03.rhts.boston.redhat.com, in case of the rhts link getting expired. thats odd, it doesn't look like this can happen on any of the lists that are internal to the megaraid driver, as every list action is guarded by a spinlock. Is there a coredump by chance? I don't think there is a vmcore for capture Kernel panic. Do you think it is useful to run some SysRq commands after the panic? If so, I might be possible to send some SysRq keys over serial console there. a core dump would be best so i could figure out what list is corrupting, but sysrq may be helpful so I could see what is running on the other processors and such. ooh jeez i completely missed that this was a kdump kernel panic, sorry about that. Sysrq will be fine, I'll take a look at this stuff again. how easily is this reproduced? I think I have a fix, but I want to make sure it will be obvious to see that the problem is fixed before we go trying it as its kind of a shot in the dark. The failure rate was pretty high, and the panic was easy to reproduce at least on those two machines, dell-pe2900-02.rhts.boston.redhat.com dell-pe2900-03.rhts.boston.redhat.com The way to reproduce this is, wget http://porkchop.devel.redhat.com/qa/rhts/lookaside/ltp-kdump-20080228.tar.gz tar zxvf ltp-kdump-20080228.tar.gz cd kdump export USE_SYMBOL_NAME=1 make insmod lkdtm.ko cpoint_name=INT_TASKLET_ENTRY cpoint_type=BUG cpoint_count=10 Looks like it could be reproduced by SysRq-C too. Created attachment 304872 [details]
debug patch to figure out where exactly we're corrupting the list.
Could you give this patch a whirl and see if it fixes the problem? If it does
I'd still like to see the logs so I can figure out which of the situations is
happening.
Affected machines are unavailable at the moment. ok, let me know when you have a chance to test the patch. I don't know what have been changed for those machines, but the problem could not be reproduced for the previous buggy (.83) and the latest RHEL5U2 kernel any more within the whole day's trying. I'll report there once I have seen another occurrence. ok i'm going to close this, if the problem happens again just reopen. thanks. I reopen this BZ, as the problem still exists in RHEL-5.2 GA Kernel, http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=3851039 It just could not be reproduced every time. this needs to be reassigned to a megaraid person, though i dont know who that is. The problem is still seen in RHEL5.4, and also with i386. megasas_aen_polling[2]: event code 0x0027 megasas_aen_polling[2]: event code 0x0000 list_add corruption. prev->next should be c9e5871c, but was c9d4e960 ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:31! invalid opcode: 0000 [#1] SMP last sysfs file: /block/ram0/dev Modules linked in: dm_snapshot dm_zero dm_mirror dm_log dm_mod ext3 jbd ata_piix libata megaraid_sas usb_storage sd_mod scsi_mod uhci_hcd ohci_hcd ehci_hcd CPU: 0 EIP: 0060:[<c20ee305>] Not tainted VLI EFLAGS: 00010046 (2.6.18-154.el5PAE #1) EIP is at __list_add+0x39/0x52 eax: 00000048 ebx: c9e5871c ecx: 00000046 edx: 00000000 esi: c9d4e960 edi: c9c40ba0 ebp: c9c40b80 esp: c233df98 ds: 007b es: 007b ss: 0068 Process swapper (pid: 0, ti=c233d000 task=c22813c0 task.ti=c22f8000) Stack: c2245a33 c9e5871c c9d4e960 c9e58724 00000086 00000000 ca86631d 00000286 00000285 c9e582e0 00000286 c9e587a0 00000001 00000000 0000000a c20293f7 c22f8f74 00000001 c22eeb28 c20292f3 00000000 c22f8f74 c22f8000 00000046 Call Trace: [<ca86631d>] megasas_complete_cmd_dpc+0x255/0x404 [megaraid_sas] [<c20293f7>] tasklet_action+0x77/0xf0 [<c20292f3>] __do_softirq+0x87/0x114 [<c20073d7>] do_softirq+0x52/0x9c [<c204b548>] __do_IRQ+0x0/0xd6 [<c20074d6>] do_IRQ+0xb5/0xc3 [<c2005946>] common_interrupt+0x1a/0x20 [<c2003ce7>] mwait_idle+0x25/0x38 [<c2003ca8>] cpu_idle+0x9f/0xb9 [<c22fd9f0>] start_kernel+0x37b/0x383 ======================= Code: 74 17 50 52 68 f6 59 24 c2 e8 23 6c f3 ff 0f 0b 1a 00 a8 59 24 c2 83 c4 0c 8b 06 39 d8 74 17 50 53 68 33 5a 24 c2 e8 06 6c f3 ff <0f> 0b 1f 00 a8 59 24 c2 83 c4 0c 89 7b 04 89 1f 89 77 04 89 3e EIP: [<c20ee305>] __list_add+0x39/0x52 SS:ESP 0068:c233df98 <0>Kernel panic - not syncing: Fatal exception in interrupt CC'ing LSI. Have your guys see this? Any suggestions for a fix? Moving it to RHEL5.6, no chance for RHEL5.5. This one looks similar to bz#499876, there is a test kernel https://bugzilla.redhat.com/show_bug.cgi?id=499876#c16 if someone would like to give it a chance here also? I am also seeing this bug in 5.4 root@usageb02 ~ # release Red Hat Enterprise Linux Server release 5.4 (Tikanga) root@usageb02 ~ # uname -a Linux usageb02 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux root@usageb02 ~ # lockd_down: lockd failed to exit, clearing pid^M <157>Feb 28 19:55:06 usageb02 lockd_up: no pid, 2 users??^M <157>Feb 28 19:55:06 usageb02 list_add corruption. prev->next should be ffff81042fe117d8, but was 0000000000000000^M <157>Feb 28 19:55:06 usageb02 ----------- [cut here ] --------- [please bite here ] ---------^M <157>Feb 28 19:55:06 usageb02 Kernel BUG at lib/list_debug.c:31^M <157>Feb 28 19:55:06 usageb02 invalid opcode: 0000 [1] SMP ^M <157>Feb 28 19:55:06 usageb02 last sysfs file: /class/fc_remote_ports/rport-1:0-9/scsi_target_id^M <157>Feb 28 19:55:06 usageb02 CPU 3 ^M <157>Feb 28 19:55:06 usageb02 Modules linked in: nfs fscache nfs_acl mptctl mptbase ipmi_si(U) ipmi_devintf(U) ipmi_msghandler(U) ipv6 xfrm_nalgo crypto_api autofs4 lockd sunrpc dm_round_robin dm_multipath scsi_dh video hwmon backlight sbs i2c_ec i2c_core button battery asus_acpi acpi_memhotplug ac parport_pc lp parport i5000_edac shpchp edac_mc bnx2 pcspkr sg hpilo serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd^M <157>Feb 28 19:55:06 usageb02 Pid: 16907, comm: lockd Tainted: G 2.6.18-164.el5 #1^M <157>Feb 28 19:55:06 usageb02 RIP: 0010:[<ffffffff80151298>] [<ffffffff80151298>] __list_add+0x48/0x68^M <157>Feb 28 19:55:06 usageb02 RSP: 0000:ffff81018f45ded0 EFLAGS: 00010082^M <157>Feb 28 19:55:06 usageb02 RAX: 0000000000000058 RBX: ffff81042fe117d8 RCX: 0000000000000086^M <157>Feb 28 19:55:06 usageb02 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff80308c5c^M <157>Feb 28 19:55:06 usageb02 RBP: ffffffff88444900 R08: 00000000000000a0 R09: 000000000000003c^M <157>Feb 28 19:55:06 usageb02 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff88444900^M <157>Feb 28 19:55:06 usageb02 R13: 00000002f3df20ed R14: 0000000000000000 R15: ffff810214a96e80^M <157>Feb 28 19:55:06 usageb02 FS: 00002b1d187ae6e0(0000) GS:ffff81042ff26640(0000) knlGS:0000000000000000^M <157>Feb 28 19:55:06 usageb02 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M <157>Feb 28 19:55:06 usageb02 CR2: 00007ffff470c000 CR3: 000000030659d000 CR4: 00000000000006e0^M <157>Feb 28 19:55:06 usageb02 Process lockd (pid: 16907, threadinfo ffff81018f45c000, task ffff81042f46c860)^M <157>Feb 28 19:55:06 usageb02 Stack: ffffffff88444900 ffff81042fe10000 ffff81042fe10000 ffffffff8001ca64^M <157>Feb 28 19:55:06 usageb02 ffff81018f45df20 0000000000000286 00000002f3df20ed ffff81041fb8e000^M <157>Feb 28 19:55:06 usageb02 ffff81003b744140 0000000000000003 ffffffff8842e1cd ffffffff8842e2de^M <157>Feb 28 19:55:06 usageb02 Call Trace:^M <157>Feb 28 19:55:06 usageb02 [<ffffffff8001ca64>] __mod_timer+0xa3/0xbe^M <157>Feb 28 19:55:06 usageb02 [<ffffffff8842e1cd>] :lockd:lockd+0x0/0x2bf^M <157>Feb 28 19:55:06 usageb02 [<ffffffff8842e2de>] :lockd:lockd+0x111/0x2bf^M <157>Feb 28 19:55:06 usageb02 [<ffffffff8005dfb1>] child_rip+0xa/0x11^M <157>Feb 28 19:55:06 usageb02 [<ffffffff8842e1cd>] :lockd:lockd+0x0/0x2bf^M <157>Feb 28 19:55:06 usageb02 [<ffffffff8842e1cd>] :lockd:lockd+0x0/0x2bf^M <157>Feb 28 19:55:06 usageb02 [<ffffffff8005dfa7>] child_rip+0x0/0x11^M <157>Feb 28 19:55:06 usageb02 ^M <157>Feb 28 19:55:06 usageb02 ^M <157>Feb 28 19:55:06 usageb02 Code: 0f 0b 68 e9 41 2b 80 c2 1f 00 4c 89 63 08 49 89 1c 24 4c 89 ^M <157>Feb 28 19:55:06 usageb02 RIP [<ffffffff80151298>] __list_add+0x48/0x68^M <157>Feb 28 19:55:06 usageb02 RSP <ffff81018f45ded0>^M The update to version 4.31 is included in kernel -210, chances are that this could mend the issue. Could you retest with latest kernel? Thanks, Tomas I cannot reproduce this bug with -83 kernel. I have tried about 39 times. Thanks for testing. We can't reproduce this and anyhow expect that the bz#564249 has solved this. I'm closing this one as dup of bz#564249 *** This bug has been marked as a duplicate of bug 564249 *** We have reproduced it with 5.6 RC1, kernel version is 2.6.18-237.el5: https://beaker.engineering.redhat.com/recipes/84092 ... ... SysRq : Trigger a crashdump Memory for crash kernel (0x0 to 0x0) notwithin permissible range ?Mounting proc filesystem Mounting sysfs filesystem Creating /dev Creating initial device nodes Loading ehci-hcd.ko module Loading ohci-hcd.ko module Loading uhci-hcd.ko module Loading scsi_mod.ko module Loading sd_mod.ko module Loading megaraid_sas.ko module Loading libata.ko module Loading ata_piix.ko module Loading usb-storage.ko module Waiting 8 seconds for driver initialization. list_add corruption. prev->next should be ffff81000146aa60, but was ffff810008c0d2e8 ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at lib/list_debug.c:31 invalid opcode: 0000 [1] SMP last sysfs file: CPU 0 Modules linked in: usb_storage ata_piix libata megaraid_sas sd_mod scsi_mod uhci_hcd ohci_hcd ehci_hcd Pid: 0, comm: swapper Not tainted 2.6.18-237.el5 #1 RIP: 0010:[<ffffffff80157c0e>] [<ffffffff80157c0e>] __list_add+0x48/0x68 RSP: 0000:ffffffff804a5eb0 EFLAGS: 00010086 RAX: 0000000000000058 RBX: ffff81000146aa60 RCX: 0000000000000082 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff80319f5c RBP: ffff810008c0d2e8 R08: 0000000000000005 R09: 0000000000000038 R10: ffffffff804525a0 R11: 0000000000000000 R12: ffff810008d06be8 R13: ffff81000146a4f8 R14: 0000000000000194 R15: 0000000000000195 FS: 0000000000000000(0000) GS:ffffffff80425000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007fffeaacbfd8 CR3: 0000000008caa000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffffffff80456000, task ffffffff80310b60) Stack: ffff81000146aa70 ffff810008d06bc0 0000000000000046 ffffffff88073fa6 0000000000000000 ffff81000146aa78 0000000000000282 ffff81000146ab40 0000000000000001 0000000000000000 000000000000000a 0000000000000000 Call Trace: <IRQ> [<ffffffff88073fa6>] :megaraid_sas:megasas_complete_cmd_dpc+0x30d/0x4da [<ffffffff800967cc>] tasklet_action+0x89/0x125 [<ffffffff80012464>] __do_softirq+0x89/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d [<ffffffff8006d485>] do_IRQ+0xec/0xf5 [<ffffffff80057020>] mwait_idle+0x0/0x20 [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff8006b981>] mwait_idle_with_hints+0x66/0x67 [<ffffffff8005702c>] mwait_idle+0xc/0x20 [<ffffffff800492fd>] cpu_idle+0x95/0xb8 [<ffffffff80461807>] start_kernel+0x220/0x225 [<ffffffff8046122f>] _sinittext+0x22f/0x236 Code: 0f 0b 68 cd 3b 2c 80 c2 1f 00 4c 89 63 08 49 89 1c 24 4c 89 RIP [<ffffffff80157c0e>] __list_add+0x48/0x68 RSP <ffffffff804a5eb0> <0>Kernel panic - not syncing: Fatal exception We were testing dumping to raw partition when triggering this problem. (In reply to comment #33) > We were testing dumping to raw partition when triggering this problem. Is this important - I mean can you reproduce it on raw partition, is it reproducible? (In reply to comment #34) > (In reply to comment #33) > > We were testing dumping to raw partition when triggering this problem. > > Is this important - I mean can you reproduce it on raw partition, is it > reproducible? I can't reproduce this issue on dell-pe2900-02.rhts.eng.bos.redhat.com via dump to raw target. I tried more than 20 times, all dump successfully. No kernel panic found: https://beaker.engineering.redhat.com/recipes/98483 (In reply to comment #36) > I can't reproduce this issue on dell-pe2900-02.rhts.eng.bos.redhat.com via dump > to raw target. I tried more than 20 times, all dump successfully. No kernel > panic found: > https://beaker.engineering.redhat.com/recipes/98483 Test was executed on RHEL5-Server-U6, with kernel-2.6.18-238.el5.i686 kexec-tools-1.102pre-126.el5.i386 Looks like not reproducible anymore. |