Bug 475987
Summary: | kdump: ahci.ko module loading causes softlockup warnings on xw8800 and xw6800 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Prarit Bhargava <prarit> | ||||||
Component: | kernel | Assignee: | Neil Horman <nhorman> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Red Hat Kernel QE team <kernel-qe> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | 5.3 | CC: | dchapman, dzickus, jgarzik, nhorman, qcai | ||||||
Target Milestone: | rc | Keywords: | Reopened | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-06-18 14:30:40 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 475418 | ||||||||
Bug Blocks: | |||||||||
Attachments: |
|
Description
Prarit Bhargava
2008-12-11 13:26:17 UTC
I do not see the soft lockups on an xw8800 with BIOS 0.34, kernel 2.6.18-126.el5, and the fix for bug 475843 *** This bug has been marked as a duplicate of bug 475843 *** Still seeing these on the xw8800 in rhts. Evidently there is a difference between that system and the one David has. David, do you know if there is a way to check the BIOS version from the OS? Since we only have serial console on the box we can't easily see the bios version that gets reported on power up. the xw8800 in the lab appears to have the same bios rev as what David's has: Version: 786G5 v00.34 since this is not resolved by the kdump fix for BZ 475843 I am un-duping it. Created attachment 328790 [details] patch to port reserve regions patch to x86 Prarit, just noted that the patch in the other bz was for 64 bit, and this kernel that you reported this problem on was 32 bit. This patch is the 32 bit version of that kexec-patch (which I wrote initially for bz 473404). Please give it a try and let me know if it fixes the issue. Thanks! Hey Neil, just retested on an i386 RHEL5 box and I no longer see the issue above. However :(, I now have a new softlockup warning: ======================= BUG: soft lockup - CPU#0 stuck for 10s! [exe:499] Pid: 499, comm: exe EIP: 0060:[<c104d219>] CPU: 0 EIP is at handle_IRQ_event+0x39/0x8c EFLAGS: 00000246 Tainted: G (2.6.18.4 #1) EAX: 0000000c EBX: c12e7480 ECX: c8eabc40 EDX: c8c9fdd4 ESI: c8eabc40 EDI: 0000000c EBP: 00000000 DS: 007b ES: 007b CR0: 8005003b CR2: b7f81000 CR3: 01310000 CR4: 000006d0 [<c104d2f0>] __do_IRQ+0x84/0xd6 [<c104d26c>] __do_IRQ+0x0/0xd6 [<c10074ce>] do_IRQ+0x99/0xc3 [<c1005946>] common_interrupt+0x1a/0x20 [<c10249d5>] vprintk+0x2b4/0x2e8 [<c101e3cd>] try_to_wake_up+0x3e6/0x3f0 [<c101f7ce>] __cond_resched+0x16/0x34 [<c103ad61>] __link_module+0x0/0x10 [<c101f7ce>] __cond_resched+0x16/0x34 [<c120e870>] cond_resched+0x2a/0x31 [<c1024a21>] printk+0x18/0x8e [<c9858014>] fusion_init+0x14/0xd1 [mptbase] [<c103d05c>] sys_init_module+0x1ae0/0x1c95 [<c10ee330>] pci_bus_read_config_byte+0x0/0x58 [<c10094c6>] sys_mmap2+0x99/0xa3 [<c1004f17>] syscall_call+0x7/0xb ======================= The above was seen on RHTS Test information: HOSTNAME=hp-xw8800-01.rhts.bos.redhat.com JOBID=41756 RECIPEID=144548 LAB_SERVER=lab.rhts.bos.redhat.com RESULT_SERVER=rhts.redhat.com DISTRO=RHEL5.3-Server-20090106.0 ARCHITECTURE=i386 running the latest kernel + kexec-tools built with the patch in comment #5. P. best guess at the above is that this might be a new different occurance of the same type of bug. In both the old and new cases we seem to be stuck in do_IRQ, most likely handling an interrupt which is never getting cleared. Since that means we will forever loop in do_IRQ, we never touch the softlockup watchdog, and we trip the bug. Previously, it seems that this was occuring due to our inability to deal with the usb bus ahci interface, and it seems that adding in those reserved sections has fixed that. In the new problem it appears that we are getting the same type of problem, but during the handling of a different interrupt (possibly related to the HBA driven by the mptbase driver) Prarit, can you do 2 things for me please: 1) verify that with the patch I provided, all the e820 reserved and ACPI sections are getting properly mapped? You should be able to simply compare the user e820 map to the bios e820 map 2) Configure kdump to dump via scp or nfs, and manually modify the initramfs to not load the mptbase module? If this solves the above softlockup, that should be a good indicator that this is a spearate problem that we can deal with in a new bug, using this bug to commit the kexec changes. Thanks! Neil, here you go. The odd thing is that when I configure to do a scp dump, the crash kernel hangs at Security Framework v1.0.0 initialized SELinux: Initializing. selinux_register_security: Registering secondary module capability Capability LSM initialized as secondary Mount-cache hash table entries: 512 monitor/mwait feature present. using mwait in idle threads. CPU: L1 I cache: 32K, L1 D cache: 32K CPU: L2 cache: 256K CPU: L3 cache: 8192K CPU: Physical Processor ID: 0 CPU: Processor Core ID: 1 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. Checking 'hlt' instruction... OK. SMP alternatives: switching to UP code Freeing SMP alternatives: 14k freed ACPI: Core revision 20060707 CPU0: Intel(R) Xeon(R) CPU X5560 @ 2.80GHz stepping 04 Total of 1 processors activated (5611.88 BogoMIPS). ENABLING IO-APIC IRQs ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1 Brought up 1 CPUs checking if image is initramfs... it is Here are the results for the E820 map: Bare-metal kernel: Linux version 2.6.18.4 (root.bos.redhat.com) (gcc version 4.1. 2 20080704 (Red Hat 4.1.2-44)) #1 SMP Wed Jan 14 07:37:01 EST 2009 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 0000000000095800 (usable) BIOS-e820: 0000000000095800 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 00000000befb5800 (usable) BIOS-e820: 00000000befb5800 - 00000000c0000000 (reserved) BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved) BIOS-e820: 00000000fec00000 - 00000000fed40000 (reserved) BIOS-e820: 00000000fed45000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 0000000140000000 (usable) Crash kernel: hp-xw8800-01.rhts.bos.redhat.com login: SysRq : Trigger a crashdump Linux version 2.6.18.4 (root.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Wed Jan 14 07:37:01 EST 2009 BIOS-provided physical RAM map: BIOS-e820: 0000000000000100 - 0000000000095800 (usable) BIOS-e820: 0000000000095800 - 00000000000a0000 (reserved) BIOS-e820: 0000000000100000 - 00000000befb5800 (usable) BIOS-e820: 00000000befb5800 - 00000000c0000000 (reserved) BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved) BIOS-e820: 00000000fec00000 - 00000000fed40000 (reserved) BIOS-e820: 00000000fed45000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - 0000000140000000 (usable) user-defined physical RAM map: user: 0000000000000000 - 00000000000a0000 (usable) user: 0000000008000000 - 0000000008f5b000 (usable) <snip> Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200 irqpoll maxcpus=1 reset_devices memmap=exactmap memmap=640K@0K memmap=15724K@131072K elfcorehdr=146796K hmm, odd, it looks like the additional maps from the command line aren't there in the kdump kernel. I'll look into it further (In reply to comment #10) > hmm, odd, it looks like the additional maps from the command line aren't there > in the kdump kernel. I'll look into it further Ahh, did we ever make those changes to the i386 specific code in kdump? I am guessing that's what the problem is since the code for this is in arch specific code. Yeah, thats what the patch I attached here does (or at least did for me). It looks like prait may have not applied it during his testing. I just applied it and tried, and it definately gets farter. now its hanging during the initramfs setup, so I'm digging into this next. Gotta love the xw series :) grr, it would appear that this particular xw system doesn't list the first 640k as being ram, it lists the first 958k as being ram, and the remaining 100k as being 'reserved'. This is problematic, because kexec wants to assume that the first 640k is ram (as it should be on all x86 systems I think). I'll fix up kexec tools to properly map that and see if it gets me past the initramfs setup. ok, So I've managed to get the system booting past where it was. By adding in all the missing reserved areas. I'll attach the cleaned up patch shortly. Unfortunately, its still hanging as we try to decompress the initramfs. I have no idea why. I think it may be a separate problem as we no longer see any soft lockup notification. I think we should probably use this bug to commit the kexec patch to map the reserved areas, and open a new bug to track the hang on decompressing the initramfs. Created attachment 329243 [details]
cleaned up version of patch
doug, do you have any thoughts regarding this problem? With my patch above I definately get farther into the booting process, but then the system in question hangs duing the setup of the initramfs (prior to us executing anything in it). Does that ring any bells with you on prior xw series systems that youve debugged kdump on? The hang at: checking if image is initramfs... it is was caused by a typo in the crashkernel arg line, it was given as 16M@128M which gave the kdump kernel only 16M to run in which isn't enough to expand the initrd. once I fixed this I get the lockup that prarit mentioned in comment #6. It looks like a race between two printf's however I don't know how that is possible. I did find that a reliable workaround was to remove the maxcpus=1 from the kdump config so it boots the kdump kernel with all the cpus. Also, I found that in rare conditions it works OK even with the maxcpus=1 so this is certainly a race condition. From looking at the initial stack trace from the ahci module it looks like that was actually hitting the same situation, just from another driver. I will continue to dig. It appears that this is happening during the second printk of the first module that is loaded. The system was loading mptbase.ko first, I did a bunch of printk("hello world\n") statements in fusion_init and if it hung (which is most of the time but not all) we always saw the first printk but not the second. I then hacked the kdump initrd to change the module loading order. If I load scsi_mod.ko first I get the lockup with a new path: Creating initial device nodes Loading scsi_mod.ko module SCSI subsystem initialized BUG: soft lockup - CPU#0 stuck for 10s! [exe:508] Pid: 508, comm: exe EIP: 0060:[<c104d219>] CPU: 0 EIP is at handle_IRQ_event+0x39/0x8c EFLAGS: 00000246 Tainted: G (2.6.18-prep #1) EAX: 00000001 EBX: c12e6f00 ECX: d0eefba0 EDX: d0d1edd8 ESI: d0eefba0 EDI: 00000001 EBP: 00000000 DS: 007b ES: 007b CR0: 8005003b CR2: b7f54000 CR3: 01310000 CR4: 000006d0 [<c104d2f0>] __do_IRQ+0x84/0xd6 [<c104d26c>] __do_IRQ+0x0/0xd6 [<c10074ce>] do_IRQ+0x99/0xc3 [<c1005946>] common_interrupt+0x1a/0x20 [<c10249d5>] vprintk+0x2b4/0x2e8 [<c1309d50>] init_sel_fs+0x2d/0x52 [<c101d7fe>] __wake_up+0x2a/0x3d [<c1024a21>] printk+0x18/0x8e [<d187d08a>] init_scsi+0x8a/0xbb [scsi_mod] [<c103d05c>] sys_init_module+0x1ae0/0x1c95 [<c10dc4ab>] blk_init_queue+0x0/0x8 [<c10094c6>] sys_mmap2+0x99/0xa3 [<c1004f17>] syscall_call+0x7/0xb In the process of experimenting I hit a different stack track which I think points us to the serial driver (which seems logicial based on the behavior): Pid: 510, comm: exe EIP: 0060:[<c104d219>] CPU: 0 EIP is at handle_IRQ_event+0x39/0x8c EFLAGS: 00000246 Tainted: G (2.6.18-prep #1) EAX: 00000001 EBX: c12e6f00 ECX: d0d09ba0 EDX: c1354f70 ESI: d0d09ba0 EDI: 00000001 EBP: 00000000 DS: 007b ES: 007b CR0: 80050033 CR2: 081b1230 CR3: 01312000 CR4: 000006d0 [<c104d2f0>] __do_IRQ+0x84/0xd6 [<c10074e5>] do_IRQ+0xb0/0xc3 [<c1005946>] common_interrupt+0x1a/0x20 [<c120ff60>] _spin_unlock_irqrestore+0x8/0x9 [<c11509fc>] serial8250_interrupt+0x8b/0xdd [<c104d225>] handle_IRQ_event+0x45/0x8c [<c104d2f0>] __do_IRQ+0x84/0xd6 [<c104d26c>] __do_IRQ+0x0/0xd6 [<c10074ce>] do_IRQ+0x99/0xc3 [<c1005946>] common_interrupt+0x1a/0x20 [<c105cacd>] __pagevec_lru_add_active+0x89/0x9e [<c105cd86>] __lru_add_drain+0x2f/0x76 [<c10641bd>] exit_mmap+0x1b/0xdf [<c102207b>] mmput+0x25/0x69 [<c1026a2d>] do_exit+0x20c/0x7a2 [<c1313e40>] inet_init+0x79/0x306 [<c1027039>] sys_exit_group+0x0/0xd [<c1004f17>] syscall_call+0x7/0xb based on dougs findings above, I just tried a kdump with the console=serial line removed from the command line, and it was successful. This implies that we have a race in the serial console driver I'm starting a series of tests. Instead of removing the serial console from the equation, I'm leaving it in place. Instead I'm modifying the initramfs to change the value in /proc/sys/kernel/printk, to vary the console loglevel threshold. The hope is to get a better idea of which message level is triggering this crash. That should narrow down the set of printk call sites that might be triggering this bug This is getting rather irritating. I just tried 2 dumps with a loglevel of 7, adn they both worked (In reply to comment #22) > This is getting rather irritating. I just tried 2 dumps with a loglevel of 7, > adn they both worked Still reproduces easily for me. If you tried it just this morning then I had the system set up to use VGA for the kdump console. I just now switched it back to using serial and it failed on the first try. dang! It did for me too. Whats worse, I just reduce the console log level again, and it paniced on me! grr.... scratch my last comment, I'd implemented the log level change wrong. at level 6 it seems to work consistently, although I can't find any significant different between log level 6 and log level 7 in the console output Doug and Neil, See BZ 455422 for panic on softlockup option which may/may not help debug this... P. prarit, are you sure you have the right bz there? That bug is for some power management formatting changes Oops. I meant 445422. https://bugzilla.redhat.com/show_bug.cgi?id=445422 Sorry about that nhorman. P. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Updating PM score. prarit, Using the -133.el5 kernel (32 bit) and kexec-tools-1.102-670.el5, I've been able to get a few dumps in a row here, Do you think this problem has been fixed? (In reply to comment #31) > prarit, Using the -133.el5 kernel (32 bit) and kexec-tools-1.102-670.el5, I've > been able to get a few dumps in a row here, Do you think this problem has been > fixed? Dumps -- actual dumps? ie) the system is crashing, not just spitting out softlockup warnings? P. I mean kdump is working as designed. I crash the system and get a vmcore image saved to the location I specify, no softlockup warnings or anything. ping prarit, have you had a chance to retry this recently? nm, I just went to go and try to reserve a system to recreate this issue, and the xw8800 in the lab has been removed. As we no longer have access to the hardware that this bug was filed against, and given that I'm reasonably certain that this bug was fixed with bz 475843, I think we should close this as CURRENTRELEASE, and reopen if someone else reports it. |