Bug 436258 - [5.2][kdump] recursive die() failure did not trigger kdump
[5.2][kdump] recursive die() failure did not trigger kdump
Status: CLOSED CANTFIX
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.2
i386 Linux
low Severity low
: rc
: ---
Assigned To: Neil Horman
Martin Jenner
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-03-06 01:53 EST by CAI Qian
Modified: 2008-03-06 16:23 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-03-06 16:23:57 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description CAI Qian 2008-03-06 01:53:13 EST
Description of problem:
When test one of LTP kdump crash scenario KPTEO (overflow in tasklet_action) or
KPIDO (overflow in do_irq), it failed to trigger kdump after the first kernel Oops,

...
current esp cccccccc does not match saved esp c072afe8
Saved registers for jprobe f89e84a0
Modules linked in: lkdtm(U) autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6
xfrm_nalgo crypto_api dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec
button battery asus_acpi ac lp joydev i6300esb floppy e1000 ide_cd pcspkr
parport_pc serio_raw sg i82875p_edac cdrom edac_mc parport i2c_i801 i2c_core
ata_piix libata aacraid sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
CPU:    0
EIP:    65df:[<f89e62d4>]    Tainted: G      VLI
EFLAGS: 00000001   (2.6.18-83.el5 #1) 
EIP is at lkdtm_handler+0xa3/0xaf [lkdtm]
eax: ffffffff   ebx: ffffffff   ecx: ffffffff   edx: ffffffff
esi: ffffffff   edi: ffffffff   ebp: ffffffff   esp: c042a7e6
ds: ffff   es: 0001   ss: 7f94
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000084
 printing eip:
c060a35c
*pde = eefef067
Oops: 0000 [#1]
SMP 
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000084
 printing eip:
c060a35c
*pde = eefef067
Oops: 0000 [#2]
SMP 
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000084
 printing eip:
c060a35c
*pde = eefef067
Recursive die() failure, output suppressed
...

Full log:
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2070104

Version-Release number of selected component (if applicable):
RHEL5.2-Server-20080225.2
kernel-2.6.18-83.el5
kexec-tools-1.102pre-10.el5

How reproducible:
Reproduced it several times on dell-pe700-01.rhts.boston.redhat.com

Steps to Reproduce:
1. reserved one of the affected machines, and configured kdump and booted the
kernel with "crashkernel=128M@16M nmi_watchdog=1".

2. wget
http://porkchop.devel.redhat.com/qa/rhts/lookaside/ltp-kdump-20080228.tar.gz; cd
kdump/lib/lkdtm; export USE_SYMBOL_NAME=1; make

3. echo 1 > /proc/sys/kernel/panic_on_oops; insmod lkdtm.ko
cpoint_name=INT_HARDWARE_ENTRY cpoint_type=OVERFLOW cpoint_count=10

  OR

3. echo 1 > /proc/sys/kernel/panic_on_oops; insmod lkdtm.ko;
cpoint_name=INT_TASKLET_ENTRY cpoint_type=OVERFLOW cpoint_count=10

  
Actual results:
Machine was dead. 

Expected results:
Capture kernel started and then saved a vmcore.

Additional info:
Neil, do you think the patch from
https://bugzilla.redhat.com/show_bug.cgi?id=346431#c37 will help here?
Comment 1 CAI Qian 2008-03-06 02:39:22 EST
The problem is in RHEL 5.1 stock Kernel 2.6.18-53.el5 too.
Comment 2 Neil Horman 2008-03-06 16:23:57 EST
I think we're going to be out of luck with this one.

Looking at the stack traces above, It appears that we're constantly recursing
theough the page_fault handler.  Juding by how much output we get before we oops
again, it appears that our regs structure is corrupted.  Without that we aren't
going to be able to successfully set up a core dump

Note You need to log in before you can comment on or make changes to this bug.