Bug 441657 - [5.2][kdump] NMI Watchdog does not trigger kdump on NEC machines
[5.2][kdump] NMI Watchdog does not trigger kdump on NEC machines
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kexec-tools (Show other bugs)
5.2
i386 Linux
low Severity low
: rc
: ---
Assigned To: Neil Horman
:
: 441662 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-04-09 07:42 EDT by CAI Qian
Modified: 2008-10-22 08:29 EDT (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-10-22 08:29:13 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
normal Kernel boot log (16.40 KB, text/plain)
2008-04-09 21:35 EDT, CAI Qian
no flags Details
patch to remove printks from lkdtm_handler (929 bytes, patch)
2008-04-10 07:19 EDT, Neil Horman
no flags Details | Diff
Different call trace (2.18 KB, text/plain)
2008-04-13 01:08 EDT, Cai Qian
no flags Details

  None (edit)
Description CAI Qian 2008-04-09 07:42:38 EDT
Description of problem:
LTP kdump INT_HARDWARE_ENTRY LOOP test cases failed to trigger kdump on some NEC
machines,

nec-em18.rhts.boston.redhat.com login: BUG: NMI Watchdog detected LOCKUP on
CPU0, eip f8a0e0ed, registers:
Modules linked in: lkdtm(U) autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6
xfrm_nalgo crypto_api dm_multipath video sbs backlight i2c_ec button battery
asus_acpi ac parport_pc lp parport e752x_edac edac_mc sg floppy pcspkr ide_cd
cdrom i2c_i801 e1000 i2c_core dm_snapshot dm_zero dm_mirror dm_mod ata_piix
libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
CPU:    0
EIP:    0060:[<f8a0e0ed>]    Tainted: G      VLI
EFLAGS: 00000046   (2.6.18-88.el5 #1) 
EIP is at lkdtm_handler+0x90/0xaf [lkdtm]
eax: 00000010   ebx: c06e7f90   ecx: f8a0e461   edx: 00000000
esi: 0000000f   edi: c06e7fa0   ebp: c061ff43   esp: c074bff8
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, ti=c074b000 task=c06713c0 task.ti=c06e7000)
Stack: f8a0e166 c04073f4 
Call Trace:
 [<f8a0e166>] jp_do_irq+0x5/0xd [lkdtm]
 [<c04073f4>] do_IRQ+0x93/0xae
 [<c040592e>] common_interrupt+0x1a/0x20
 [<c0403ccf>] mwait_idle+0x25/0x38
 [<c0403c90>] cpu_idle+0x9f/0xb9
 [<c06ec9ee>] start_kernel+0x379/0x380
 =======================
Code: 0f 0b 04 01 2e e4 a0 f8 eb 34 68 4b e4 a0 f8 e8 94 85 a1 c7 c7 05 00 00 00
00 00 00 00 00 eb 1e 68 61 e4 a0 f8 e8 7e 85 a1 c7 59 <eb> fe 68 72 e4 a0 f8 e8
71 85 a1 c7 31 c0 e8 00 ff ff ff 5a a1 

Then, it hung. The interesting thing is that if added the following earlyprintk
line to /etc/sysconfig/kdump, it gave me positive results three times in a row,

earlyprintk=serial,ttyS0,115200

...
 Uncompressing Linux... Ok, booting the kernel.
Linux version 2.6.18-88.el5 (brewbuilder@hs20-bc2-4.build.redhat.com) (gcc
version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Tue Apr 1 19:01:20 EDT 2008
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000100 - 000000000009b800 (usable)
 BIOS-e820: 000000000009b800 - 00000000000a0000 (reserved)
 BIOS-e820: 0000000000100000 - 000000003ff70000 (usable)
 BIOS-e820: 000000003ff70000 - 000000003ff7b000 (ACPI data)
 BIOS-e820: 000000003ff7b000 - 000000003ff80000 (ACPI NVS)
 BIOS-e820: 000000003ff80000 - 0000000040000000 (reserved)
 BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ff800000 - 00000000ffc00000 (reserved)
 BIOS-e820: 00000000fffffc00 - 0000000100000000 (reserved)
user-defined physical RAM map:
 user: 0000000000000000 - 00000000000a0000 (usable)
 user: 0000000001000000 - 0000000008f5b000 (usable)
early console enabled
0MB HIGHMEM available.
143MB LOWMEM available.
found SMP MP-table at 000f6f70
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
disabling kdump
Using x86 segment limits to approximate NX protection
DMI 2.3 present.
Using APIC driver default
ACPI: PM-Timer IO Port: 0x1008
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 15:4 APIC version 20
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1 15:4 APIC version 20
WARNING: maxcpus limit of 1 reached. Processor ignored.
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
Enabling APIC mode:  Flat.  Using 1 I/O APICs
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 10000000 (gap: 08f5b000:f70a5000)
Detected 3192.358 MHz processor.
Built 1 zonelists.  Total pages: 36699
Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200n8 
nmi_watchdog=1 irqpoll maxcpus=1 reset_devices earlyprintk=serial,ttyS0,115200
memmap=exactmap memmap=640K@0K memmap=130412K@16384K elfcorehdr=146796K
Misrouted IRQ fixup and polling support enabled
This may significantly impact system performance
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
CPU 0 irqstacks, hard=c134b000 soft=c132b000
PID hash table entries: 1024 (order: 10, 4096 bytes)
disabling early console
Linux version 2.6.18-88.el5 (brewbuilder@hs20-bc2-4.build.redhat.com) (gcc
version 4.1.2 20071124 (Red Hat 4.1.2-41)) #1 SMP Tue Apr 1 19:01:20 EDT 2008
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000100 - 000000000009b800 (usable)
 BIOS-e820: 000000000009b800 - 00000000000a0000 (reserved)
 BIOS-e820: 0000000000100000 - 000000003ff70000 (usable)
 BIOS-e820: 000000003ff70000 - 000000003ff7b000 (ACPI data)
 BIOS-e820: 000000003ff7b000 - 000000003ff80000 (ACPI NVS)
 BIOS-e820: 000000003ff80000 - 0000000040000000 (reserved)
 BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ff800000 - 00000000ffc00000 (reserved)
 BIOS-e820: 00000000fffffc00 - 0000000100000000 (reserved)
user-defined physical RAM map:
 user: 0000000000000000 - 00000000000a0000 (usable)
 user: 0000000001000000 - 0000000008f5b000 (usable)
early console enabled
0MB HIGHMEM available.
143MB LOWMEM available.
found SMP MP-table at 000f6f70
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
disabling kdump
Using x86 segment limits to approximate NX protection
DMI 2.3 present.
Using APIC driver default
ACPI: PM-Timer IO Port: 0x1008
ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
Processor #0 15:4 APIC version 20
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
Processor #1 15:4 APIC version 20
WARNING: maxcpus limit of 1 reached. Processor ignored.
ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
Enabling APIC mode:  Flat.  Using 1 I/O APICs
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 10000000 (gap: 08f5b000:f70a5000)
Detected 3192.358 MHz processor.
Built 1 zonelists.  Total pages: 36699
Kernel command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS0,115200n8 
nmi_watchdog=1 irqpoll maxcpus=1 reset_devices earlyprintk=serial,ttyS0,115200
memmap=exactmap memmap=640K@0K memmap=130412K@16384K elfcorehdr=146796K
Misrouted IRQ fixup and polling support enabled
This may significantly impact system performance
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
CPU 0 irqstacks, hard=c134b000 soft=c132b000
PID hash table entries: 1024 (order: 10, 4096 bytes)
disabling early console
Console: colour VGA+ 80x25
...

Version-Release number of selected component (if applicable):
RHEL5.2-Server-20080402.0 (i386)
kernel-2.6.18-88.el5
kexec-tools-1.102pre-20.el5

How reproducible:
Always on nec-em18.rhts.boston.redhat.com and probably
nec-em19.rhts.boston.redhat.com.

Steps to Reproduce:
-
Reserved one of the affected machines, and configured kdump and booted the
kernel with "crashkernel=128M@16M nmi_watchdog=1".
-
wget
http://porkchop.devel.redhat.com/qa/rhts/lookaside/ltp-kdump-20080228.tar.gz;
tar zxvf ltp-kdump-20080228.tar.gz; cd kdump; export USE_SYMBOL_NAME=1; make
-
insmod lkdtm.ko cpoint_name=INT_HARDWARE_ENTRY cpoint_type=LOOP cpoint_count=05
Comment 1 Neil Horman 2008-04-09 09:12:33 EDT
Can you provide the log of the production kernel boot?  It would be interesting
to know how the serial console was configured before we tried to execute the
kdump.  I wonder if we're taking a serial interrupt at an in-opportune time
during the nmi  shutdown and thats causing us problems

also, what kexec-tools are you using?  I just want to make sure that we're not
running into something already fixed, like our wierd command line length
problem.  Thanks!
Comment 2 CAI Qian 2008-04-09 21:35:09 EDT
Created attachment 301915 [details]
normal Kernel boot log

In addition, kexec-tools-1.102pre-20.el5 was used.
Comment 3 Neil Horman 2008-04-10 07:19:20 EDT
Created attachment 301964 [details]
patch to remove printks from lkdtm_handler

could you please apply this patch to the lkdtm test code, and try this again? 
The fact that you can move this forward by specifying earlyprintk suggests to
me that jprobes somehow sets the system in such a way that writing to the
console when a serial port is configured  as your console device may have a
window in which we are deadlocking.  In the meantime I'll look into how jprobes
operates, but this will be a good data point to have.  Thanks
Comment 4 Cai Qian 2008-04-12 05:47:25 EDT
The patch above did not solve the problem here. After more testing, it turned
out that "earlyprintk" may just reduce the rate of failures, but does not solve
the problem entirely. Even with earlyprintk, 3 of 9 attempts were still failed.

It looks like it did not even make into purgatory, as added --console-serial to
kexec commandline did not cause "I am in purgatory" be printed out on serial
console,

nec-em18.rhts.boston.redhat.com login: BUG: NMI Watchdog detected LOCKUP on
CPU0, eip f89f90a2, registers:
Modules linked in: lkdtm(U) autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6
xfrm_nalgo crypto_api dm_multipath video sbs backlight i2c_ec button battery
asus_acpi ac parport_pc lp parport sg e752x_edac edac_mc ide_cd e1000 pcspkr
floppy cdrom i2c_i801 i2c_core dm_snapshot dm_zero dm_mirror dm_mod ata_piix
libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
CPU:    0
EIP:    0060:[<f89f90a2>]    Tainted: G      VLI
EFLAGS: 00000097   (2.6.18-88.el5 #1) 
EIP is at lkdtm_handler+0x45/0x59 [lkdtm]
eax: 00000003   ebx: c06e7f90   ecx: c06e7000   edx: c06e7fa0
esi: 0000000f   edi: c06e7fa0   ebp: c061ff43   esp: c074bff8
ds: 007b   es: 007b   ss: 0068
Process swapper (pid: 0, ti=c074b000 task=c06713c0 task.ti=c06e7000)
Stack: f89f9110 c04073f4 
Call Trace:
 [<f89f9110>] jp_do_irq+0x5/0xd [lkdtm]
 [<c04073f4>] do_IRQ+0x93/0xae
 [<c040592e>] common_interrupt+0x1a/0x20
 [<c0403ccf>] mwait_idle+0x25/0x38
 [<c0403c90>] cpu_idle+0x9f/0xb9
 [<c06ec9ee>] start_kernel+0x379/0x380
 =======================
Code: f8 04 77 30 ff 24 85 e0 92 9f f8 68 79 93 9f f8 e8 7b cb a2 c7 0f 0b 00 01
82 93 9f f8 eb 15 c7 05 00 00 00 00 00 00 00 00 eb 09 <eb> fe 31 c0 e8 55 ff ff
ff a1 a4 a0 9f f8 a3 ac a0 9f f8 c3 e8 
Comment 5 Neil Horman 2008-04-12 15:30:33 EDT
is this trace taken when the printks are removed from the lktdm_handler function
as per my patch?  That will help me figure out more precisely where in
lktdm_handler we're dying now
Comment 6 Cai Qian 2008-04-13 01:01:27 EDT
Yes.
Comment 7 Cai Qian 2008-04-13 01:08:25 EDT
Created attachment 302248 [details]
Different call trace

Not sure if that would help, but there once I got a slightly different call
trace. It was also failed to bring up the second kernel, and with your patch
applied to lkdtm.
Comment 8 Neil Horman 2008-04-14 10:24:55 EDT
K, Will, do you know off the top of your head if there are any caveats or other
gotchas regarding crashing a system from within a jbprobe function? 
Specifically any recursive deadlock that might occur if you're planning on
issueing a printk or other write to the serial console from within such a
function when the jprobe is attached to the do_IRQ routine?
Comment 9 Neil Horman 2008-04-15 07:08:52 EDT
*** Bug 441662 has been marked as a duplicate of this bug. ***
Comment 10 CAI Qian 2008-10-22 08:29:13 EDT
I'll close this out, as using jprobe() to trigger artificial crashes probably not a good way to test Kdump. I'll create a new Kernel module to test those scenarios and open new BZs for any issue found.

Note You need to log in before you can comment on or make changes to this bug.