Bug 475987

Summary: kdump: ahci.ko module loading causes softlockup warnings on xw8800 and xw6800
Product: Red Hat Enterprise Linux 5 Reporter: Prarit Bhargava <prarit>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED CURRENTRELEASE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: low    
Version: 5.3CC: dchapman, dzickus, jgarzik, nhorman, qcai
Target Milestone: rcKeywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-06-18 14:30:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 475418    
Bug Blocks:    
Attachments:
Description Flags
patch to port reserve regions patch to x86
none
cleaned up version of patch none

Description Prarit Bhargava 2008-12-11 13:26:17 UTC
Description of problem:

When kdumping, softlockup warnings are output to the screen when the ahci module is loaded.

Version-Release number of selected component (if applicable): -125.el5 + patch from BZ 475418


How reproducible: 100%


Steps to Reproduce:
1. Boot kernel with -125.el5 + patch from BZ 475418

  
Actual results:

BUG: soft lockup - CPU#0 stuck for 10s! [exe:449]

Pid: 449, comm:                  exe
EIP: 0060:[<c104d1e9>] CPU: 0
EIP is at handle_IRQ_event+0x39/0x8c
 EFLAGS: 00000246    Tainted: G       (2.6.18.4 #2)
EAX: 0000000c EBX: c12e7480 ECX: c8e5b2c0 EDX: c8e43c7c
ESI: c8e5b2c0 EDI: 0000000c EBP: 00000000 DS: 007b ES: 007b
CR0: 8005003b CR2: b7fc5000 CR3: 01311000 CR4: 000006d0
 [<c104d2c0>] __do_IRQ+0x84/0xd6
 [<c104d23c>] __do_IRQ+0x0/0xd6
 [<c10074ce>] do_IRQ+0x99/0xc3
 [<c1005946>] common_interrupt+0x1a/0x20
 [<c102499d>] vprintk+0x2b4/0x2e8
 [<c101db0c>] __activate_task+0x4a/0x59
 [<c101881d>] io_apic_set_pci_routing+0x1b9/0x1c3
 [<c10249e9>] printk+0x18/0x8e
 [<c1123669>] acpi_pci_irq_enable+0x1a1/0x1ae
 [<c11535a3>] __driver_attach+0x0/0x6b
 [<c11aa72e>] pcibios_enable_device+0x14/0x16
 [<c10f0acc>] pci_enable_device_bars+0x21/0x2d
 [<c10f0af1>] pci_enable_device+0x19/0x37
 [<c10f0b6f>] pcim_enable_device+0x60/0x72
 [<c984617c>] ahci_init_one+0x6b/0x80e [ahci]
 [<c1031401>] call_usermodehelper_keys+0xc3/0xcf
 [<c103140d>] __call_usermodehelper+0x0/0x43
 [<c10f26d4>] pci_match_device+0x10/0xac
 [<c11535a3>] __driver_attach+0x0/0x6b
 [<c10f27bc>] pci_device_probe+0x36/0x57
 [<c11534f6>] driver_probe_device+0x42/0x92
 [<c11535e7>] __driver_attach+0x44/0x6b
 [<c1152f21>] bus_for_each_dev+0x37/0x59
 [<c1153441>] driver_attach+0x11/0x13
 [<c11535a3>] __driver_attach+0x0/0x6b
 [<c1152c29>] bus_add_driver+0x64/0xfd
 [<c10f28e1>] __pci_register_driver+0x3e/0x58
 [<c103d02c>] sys_init_module+0x1ae0/0x1c95
 [<c10094c6>] sys_mmap2+0x99/0xa3
 [<c1004f17>] syscall_call+0x7/0xb
 =======================

Expected results:

System should kdump.

Comment 1 David Aquilina 2008-12-12 20:58:49 UTC
I do not see the soft lockups on an xw8800 with BIOS 0.34, kernel 2.6.18-126.el5, and the fix for bug 475843

Comment 2 Neil Horman 2008-12-15 11:44:23 UTC

*** This bug has been marked as a duplicate of bug 475843 ***

Comment 3 Doug Chapman 2008-12-16 15:55:06 UTC
Still seeing these on the xw8800 in rhts.  Evidently there is a difference between that system and the one David has.

David, do you know if there is a way to check the BIOS version from the OS?  Since we only have serial console on the box we can't easily see the bios version that gets reported on power up.

Comment 4 Doug Chapman 2008-12-16 16:50:12 UTC
the xw8800 in the lab appears to have the same bios rev as what David's has:
Version: 786G5 v00.34

since this is not resolved by the kdump fix for BZ 475843 I am un-duping it.

Comment 5 Neil Horman 2009-01-12 21:12:35 UTC
Created attachment 328790 [details]
patch to port reserve regions patch to x86

Prarit, just noted that the patch in the other bz was for 64 bit, and this kernel that you reported this problem on was 32 bit.  This patch is the 32 bit version of that kexec-patch (which I wrote initially for bz 473404).  Please give it a try and let me know if it fixes the issue.  Thanks!

Comment 6 Prarit Bhargava 2009-01-13 15:05:59 UTC
Hey Neil, just retested on an i386 RHEL5 box and I no longer see the issue above.

However :(,  I now have a new softlockup warning:

 =======================
BUG: soft lockup - CPU#0 stuck for 10s! [exe:499]

Pid: 499, comm:                  exe
EIP: 0060:[<c104d219>] CPU: 0
EIP is at handle_IRQ_event+0x39/0x8c
 EFLAGS: 00000246    Tainted: G       (2.6.18.4 #1)
EAX: 0000000c EBX: c12e7480 ECX: c8eabc40 EDX: c8c9fdd4
ESI: c8eabc40 EDI: 0000000c EBP: 00000000 DS: 007b ES: 007b
CR0: 8005003b CR2: b7f81000 CR3: 01310000 CR4: 000006d0
 [<c104d2f0>] __do_IRQ+0x84/0xd6
 [<c104d26c>] __do_IRQ+0x0/0xd6
 [<c10074ce>] do_IRQ+0x99/0xc3
 [<c1005946>] common_interrupt+0x1a/0x20
 [<c10249d5>] vprintk+0x2b4/0x2e8
 [<c101e3cd>] try_to_wake_up+0x3e6/0x3f0
 [<c101f7ce>] __cond_resched+0x16/0x34
 [<c103ad61>] __link_module+0x0/0x10
 [<c101f7ce>] __cond_resched+0x16/0x34
 [<c120e870>] cond_resched+0x2a/0x31
 [<c1024a21>] printk+0x18/0x8e
 [<c9858014>] fusion_init+0x14/0xd1 [mptbase]
 [<c103d05c>] sys_init_module+0x1ae0/0x1c95
 [<c10ee330>] pci_bus_read_config_byte+0x0/0x58
 [<c10094c6>] sys_mmap2+0x99/0xa3
 [<c1004f17>] syscall_call+0x7/0xb
 =======================

Comment 7 Prarit Bhargava 2009-01-13 15:08:47 UTC
The above was seen on

                                                                      
      RHTS Test information:                                          
                         HOSTNAME=hp-xw8800-01.rhts.bos.redhat.com                           
                            JOBID=41756                              
                         RECIPEID=144548                           
                       LAB_SERVER=lab.rhts.bos.redhat.com                         
                    RESULT_SERVER=rhts.redhat.com                      
                           DISTRO=RHEL5.3-Server-20090106.0                             
                     ARCHITECTURE=i386  

running the latest kernel + kexec-tools built with the patch in comment #5.

P.

Comment 8 Neil Horman 2009-01-13 18:08:59 UTC
best guess at the above is that this might be a new different occurance of the same type of bug.  In both the old and new cases we seem to be stuck in do_IRQ, most likely handling an interrupt which is never getting cleared.  Since that means we will forever loop in do_IRQ, we never touch the softlockup watchdog, and we trip the bug.

Previously, it seems that this was occuring due to our inability to deal with the usb bus ahci interface, and it seems that adding in those reserved sections has fixed that.  In the new problem it appears that we are getting the same type of problem, but during the handling of a different interrupt (possibly related to the HBA driven by the mptbase driver)

Prarit, can you do 2 things for me please:
1) verify that with the patch I provided, all the e820 reserved and ACPI sections are getting properly mapped?  You should be able to simply compare the user e820 map to the bios e820 map

2) Configure kdump to dump via scp or nfs, and manually modify the initramfs to not load the mptbase module?  If this solves the above softlockup, that should be a good indicator that this is a spearate problem that we can deal with in a new bug, using this bug to commit the kexec changes.

Thanks!

Comment 9 Prarit Bhargava 2009-01-14 13:17:14 UTC
Neil, here you go.  The odd thing is that when I configure to do a scp dump, the crash kernel hangs at

Security Framework v1.0.0 initialized
SELinux:  Initializing.
selinux_register_security:  Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 512
monitor/mwait feature present.
using mwait in idle threads.
CPU: L1 I cache: 32K, L1 D cache: 32K
CPU: L2 cache: 256K
CPU: L3 cache: 8192K
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Checking 'hlt' instruction... OK.
SMP alternatives: switching to UP code
Freeing SMP alternatives: 14k freed
ACPI: Core revision 20060707
CPU0: Intel(R) Xeon(R) CPU           X5560  @ 2.80GHz stepping 04
Total of 1 processors activated (5611.88 BogoMIPS).
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
Brought up 1 CPUs
checking if image is initramfs... it is


Here are the results for the E820 map:

Bare-metal kernel:

Linux version 2.6.18.4 (root.bos.redhat.com) (gcc version 4.1.
2 20080704 (Red Hat 4.1.2-44)) #1 SMP Wed Jan 14 07:37:01 EST 2009
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 0000000000095800 (usable)
 BIOS-e820: 0000000000095800 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000befb5800 (usable)
 BIOS-e820: 00000000befb5800 - 00000000c0000000 (reserved)
 BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fed40000 (reserved)
 BIOS-e820: 00000000fed45000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000140000000 (usable)

Crash kernel:

hp-xw8800-01.rhts.bos.redhat.com login: SysRq : Trigger a crashdump
Linux version 2.6.18.4 (root.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #1 SMP Wed Jan 14 07:37:01 EST 2009
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000100 - 0000000000095800 (usable)
 BIOS-e820: 0000000000095800 - 00000000000a0000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000befb5800 (usable)
 BIOS-e820: 00000000befb5800 - 00000000c0000000 (reserved)
 BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fed40000 (reserved)
 BIOS-e820: 00000000fed45000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
user-defined physical RAM map:
 user: 0000000000000000 - 00000000000a0000 (usable)
 user: 0000000008000000 - 0000000008f5b000 (usable)
<snip>
Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200  irqpoll maxcpus=1 reset_devices  memmap=exactmap memmap=640K@0K memmap=15724K@131072K elfcorehdr=146796K

Comment 10 Neil Horman 2009-01-14 20:14:26 UTC
hmm, odd, it looks like the additional maps from the command line aren't there in the kdump kernel.  I'll look into it further

Comment 11 Doug Chapman 2009-01-14 22:52:26 UTC
(In reply to comment #10)
> hmm, odd, it looks like the additional maps from the command line aren't there
> in the kdump kernel.  I'll look into it further

Ahh, did we ever make those changes to the i386 specific code in kdump?  I am guessing that's what the problem is since the code for this is in arch specific code.

Comment 12 Neil Horman 2009-01-15 01:41:44 UTC
Yeah, thats what the patch I attached here does (or at least did for me).  It looks like prait may have not applied it during his testing.  I just applied it and tried, and it definately gets farter.  now its hanging during the initramfs setup, so I'm digging into this next.  Gotta love the xw series :)

Comment 13 Neil Horman 2009-01-15 19:04:16 UTC
grr, it would appear that this particular xw system doesn't list the first 640k as being ram, it lists the first 958k as being ram, and the remaining 100k as being 'reserved'.  This is problematic, because kexec wants to assume that the first 640k is ram (as it should be on all x86 systems I think).  I'll fix up kexec tools to properly map that and see if it gets me past the initramfs setup.

Comment 14 Neil Horman 2009-01-16 17:46:55 UTC
ok, So I've managed to get the system booting past where it was.  By adding in all the missing reserved areas.  I'll attach the cleaned up patch shortly.  Unfortunately, its still hanging as we try to decompress the initramfs.  I have no idea why.  I think it may be a separate problem as we no longer see any soft lockup notification.  I think we should probably use this bug to commit the kexec patch to map the reserved areas, and open a new bug to track the hang on decompressing the initramfs.

Comment 15 Neil Horman 2009-01-16 19:05:53 UTC
Created attachment 329243 [details]
cleaned up version of patch

Comment 16 Neil Horman 2009-01-19 18:44:50 UTC
doug, do you have any thoughts regarding this problem?  With my patch above I definately get farther into the booting process, but then the system in question hangs duing the setup of the initramfs (prior to us executing anything in it). Does that ring any bells with you on prior xw series systems that youve debugged kdump on?

Comment 17 Doug Chapman 2009-01-20 21:01:34 UTC
The hang at:
checking if image is initramfs... it is

was caused by a typo in the crashkernel arg line, it was given as 16M@128M which gave the kdump kernel only 16M to run in which isn't enough to expand the initrd.

once I fixed this I get the lockup that prarit mentioned in comment #6.  It looks like a race between two printf's however I don't know how that is possible.  I did find that a reliable workaround was to remove the maxcpus=1 from the kdump config so it boots the kdump kernel with all the cpus.  Also, I found that in rare conditions it works OK even with the maxcpus=1 so this is certainly a race condition.

From looking at the initial stack trace from the ahci module it looks like that was actually hitting the same situation, just from another driver.

I will continue to dig.

Comment 18 Doug Chapman 2009-01-21 16:07:07 UTC
It appears that this is happening during the second printk of the first module that is loaded.

The system was loading mptbase.ko first, I did a bunch of printk("hello world\n") statements in fusion_init and if it hung (which is most of the time but not all) we always saw the first printk but not the second.

I then hacked the kdump initrd to change the module loading order.  If I load scsi_mod.ko first I get the lockup with a new path:

Creating initial device nodes
Loading scsi_mod.ko module
SCSI subsystem initialized
BUG: soft lockup - CPU#0 stuck for 10s! [exe:508]

Pid: 508, comm:                  exe
EIP: 0060:[<c104d219>] CPU: 0
EIP is at handle_IRQ_event+0x39/0x8c
 EFLAGS: 00000246    Tainted: G       (2.6.18-prep #1)
EAX: 00000001 EBX: c12e6f00 ECX: d0eefba0 EDX: d0d1edd8
ESI: d0eefba0 EDI: 00000001 EBP: 00000000 DS: 007b ES: 007b
CR0: 8005003b CR2: b7f54000 CR3: 01310000 CR4: 000006d0
 [<c104d2f0>] __do_IRQ+0x84/0xd6
 [<c104d26c>] __do_IRQ+0x0/0xd6
 [<c10074ce>] do_IRQ+0x99/0xc3
 [<c1005946>] common_interrupt+0x1a/0x20
 [<c10249d5>] vprintk+0x2b4/0x2e8
 [<c1309d50>] init_sel_fs+0x2d/0x52
 [<c101d7fe>] __wake_up+0x2a/0x3d
 [<c1024a21>] printk+0x18/0x8e
 [<d187d08a>] init_scsi+0x8a/0xbb [scsi_mod]
 [<c103d05c>] sys_init_module+0x1ae0/0x1c95
 [<c10dc4ab>] blk_init_queue+0x0/0x8
 [<c10094c6>] sys_mmap2+0x99/0xa3
 [<c1004f17>] syscall_call+0x7/0xb

Comment 19 Doug Chapman 2009-01-21 18:20:17 UTC
In the process of experimenting I hit a different stack track which I think points us to the serial driver (which seems logicial based on the behavior):

Pid: 510, comm:                  exe
EIP: 0060:[<c104d219>] CPU: 0
EIP is at handle_IRQ_event+0x39/0x8c
 EFLAGS: 00000246    Tainted: G       (2.6.18-prep #1)
EAX: 00000001 EBX: c12e6f00 ECX: d0d09ba0 EDX: c1354f70
ESI: d0d09ba0 EDI: 00000001 EBP: 00000000 DS: 007b ES: 007b
CR0: 80050033 CR2: 081b1230 CR3: 01312000 CR4: 000006d0
 [<c104d2f0>] __do_IRQ+0x84/0xd6
 [<c10074e5>] do_IRQ+0xb0/0xc3
 [<c1005946>] common_interrupt+0x1a/0x20
 [<c120ff60>] _spin_unlock_irqrestore+0x8/0x9
 [<c11509fc>] serial8250_interrupt+0x8b/0xdd
 [<c104d225>] handle_IRQ_event+0x45/0x8c
 [<c104d2f0>] __do_IRQ+0x84/0xd6
 [<c104d26c>] __do_IRQ+0x0/0xd6
 [<c10074ce>] do_IRQ+0x99/0xc3
 [<c1005946>] common_interrupt+0x1a/0x20
 [<c105cacd>] __pagevec_lru_add_active+0x89/0x9e
 [<c105cd86>] __lru_add_drain+0x2f/0x76
 [<c10641bd>] exit_mmap+0x1b/0xdf
 [<c102207b>] mmput+0x25/0x69
 [<c1026a2d>] do_exit+0x20c/0x7a2
 [<c1313e40>] inet_init+0x79/0x306
 [<c1027039>] sys_exit_group+0x0/0xd
 [<c1004f17>] syscall_call+0x7/0xb

Comment 20 Neil Horman 2009-01-21 18:40:21 UTC
based on dougs findings above, I just tried a kdump with the console=serial line removed from the command line, and it was successful.  This implies that we have a race in the serial console driver

Comment 21 Neil Horman 2009-01-21 19:16:06 UTC
I'm starting a series of tests.  Instead of removing the serial console from the equation, I'm leaving it in place.  Instead I'm modifying the initramfs to change the value in /proc/sys/kernel/printk, to vary the console loglevel threshold.  The hope is to get a better idea of which message level is triggering this crash.  That should narrow down the set of printk call sites that might be triggering this bug

Comment 22 Neil Horman 2009-01-22 12:24:24 UTC
This is getting rather irritating.  I just tried 2 dumps with a loglevel of 7, adn they both worked

Comment 23 Doug Chapman 2009-01-22 13:20:57 UTC
(In reply to comment #22)
> This is getting rather irritating.  I just tried 2 dumps with a loglevel of 7,
> adn they both worked

Still reproduces easily for me.  If you tried it just this morning then I had the system set up to use VGA for the kdump console.  I just now switched it back to using serial and it failed on the first try.

Comment 24 Neil Horman 2009-01-22 16:10:37 UTC
dang!  It did for me too.  Whats worse, I just reduce the console log level again, and it paniced on me!  grr....

Comment 25 Neil Horman 2009-01-22 21:05:26 UTC
scratch my last comment, I'd implemented the log level change wrong.  at level 6 it seems  to work consistently, although I can't find any significant different between log level 6 and log level 7 in the console output

Comment 26 Prarit Bhargava 2009-01-26 13:47:59 UTC
Doug and Neil,

See BZ 455422 for panic on softlockup option which may/may not help debug this...

P.

Comment 27 Neil Horman 2009-01-26 14:32:57 UTC
prarit, are you sure you have the right bz there?  That bug is for some power management formatting changes

Comment 28 Prarit Bhargava 2009-01-26 14:54:14 UTC
Oops.  I meant 445422.

https://bugzilla.redhat.com/show_bug.cgi?id=445422

Sorry about that nhorman.

P.

Comment 29 RHEL Program Management 2009-01-27 20:41:59 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 30 RHEL Program Management 2009-02-16 15:43:17 UTC
Updating PM score.

Comment 31 Neil Horman 2009-03-27 13:39:30 UTC
prarit, Using the -133.el5 kernel (32 bit) and kexec-tools-1.102-670.el5, I've been able to get a few dumps in a row here, Do you think this problem has been fixed?

Comment 32 Prarit Bhargava 2009-03-31 12:09:59 UTC
(In reply to comment #31)
> prarit, Using the -133.el5 kernel (32 bit) and kexec-tools-1.102-670.el5, I've
> been able to get a few dumps in a row here, Do you think this problem has been
> fixed?  

Dumps -- actual dumps?  ie) the system is crashing, not just spitting out softlockup warnings?

P.

Comment 33 Neil Horman 2009-03-31 13:20:50 UTC
I mean kdump is working as designed.  I crash the system and get a vmcore image saved to the location I specify, no softlockup warnings or anything.

Comment 35 Neil Horman 2009-06-17 01:08:31 UTC
ping prarit, have you had a chance to retry this recently?

Comment 36 Neil Horman 2009-06-18 14:30:40 UTC
nm, I just went to go and try to reserve a system to recreate this issue, and the xw8800 in the lab has been removed.  As we no longer have access to the hardware that this bug was filed against, and given that I'm reasonably certain that this bug was fixed with bz 475843, I think we should close this as CURRENTRELEASE, and reopen if someone else reports it.