Hide Forgot
Description of problem: System PANIC'd while booting the kexec kernel: "Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0" Version-Release number of selected component (if applicable): 2.6.32-131.0.15.el6.x86_64 How reproducible: Issue reproduced here: [] https://beaker.engineering.redhat.com/jobs/101896 http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/06/1018/101896/208470//console.log Steps to Reproduce: 1. Install system w RHEL6.1 (distro RHEL6.1-20110510.1) 2. Configure system for kdump 3. Crash system with #echo c > /proc/sysrq-trigger Actual results: https://beaker.engineering.redhat.com/jobs/99230 http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/06/992/99230/202012//console.log <-SNIP-> Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0 Pid: 0, comm: swapper Not tainted 2.6.32-131.0.15.el6.x86_64 #1 Call Trace: <NMI> [<ffffffff814dac28>] ? panic+0x78/0x143 [<ffffffff810d649d>] ? watchdog_overflow_callback+0xcd/0xd0 [<ffffffff81108b56>] ? __perf_event_overflow+0x116/0x290 [<ffffffff81109149>] ? perf_event_overflow+0x19/0x20 [<ffffffff8101cde4>] ? p4_pmu_handle_irq+0x224/0x2f0 [<ffffffff814e06c6>] ? kprobe_exceptions_notify+0x16/0x430 [<ffffffff814df1b8>] ? perf_event_nmi_handler+0x58/0xe0 [<ffffffff814e0cf5>] ? notifier_call_chain+0x55/0x80 [<ffffffff814e0d5a>] ? atomic_notifier_call_chain+0x1a/0x20 [<ffffffff810940fe>] ? notify_die+0x2e/0x30 [<ffffffff814de963>] ? do_nmi+0x173/0x2b0 [<ffffffff814de270>] ? nmi+0x20/0x30 [<ffffffff810141a7>] ? mwait_idle+0x77/0xd0 <<EOE>> [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110 [<ffffffff814c376a>] ? rest_init+0x7a/0x80 [<ffffffff81bbdf28>] ? start_kernel+0x41d/0x429 [<ffffffff81bbd33a>] ? x86_64_start_reservations+0x125/0x129 [<ffffffff81bbd438>] ? x86_64_start_kernel+0xfa/0x109 BUG: scheduling while atomic: swapper/0/0x14010000 Modules linked in: microcode(+) tg3 hpwdt hpilo ipv6 freq_table sunrpc dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod Pid: 0, comm: swapper Not tainted 2.6.32-131.0.15.el6.x86_64 #1 Call Trace: <NMI> [<ffffffff81055cf6>] ? __schedule_bug+0x66/0x70 [<ffffffff814db8c2>] ? thread_return+0x5d9/0x777 [<ffffffff814dacd0>] ? panic+0x120/0x143 [<ffffffff814dea1a>] ? do_nmi+0x22a/0x2b0 [<ffffffff8105faba>] ? __cond_resched+0x2a/0x40 [<ffffffff814dea1a>] ? do_nmi+0x22a/0x2b0 [<ffffffff814dbbb0>] ? _cond_resched+0x30/0x40 [<ffffffff8100dff6>] ? is_valid_bugaddr+0x16/0x40 [<ffffffff812624af>] ? report_bug+0x1f/0xc0 [<ffffffff8100f31f>] ? die+0x7f/0x90 [<ffffffff814de544>] ? do_trap+0xc4/0x160 [<ffffffff8100ceb5>] ? do_invalid_op+0x95/0xb0 [<ffffffff814dea1a>] ? do_nmi+0x22a/0x2b0 [<ffffffff814e06c6>] ? kprobe_exceptions_notify+0x16/0x430 [<ffffffff8100bf5b>] ? invalid_op+0x1b/0x20 [<ffffffff814dea1a>] ? do_nmi+0x22a/0x2b0 [<ffffffff814de80c>] ? do_nmi+0x1c/0x2b0 [<ffffffff814de270>] ? nmi+0x20/0x30 [<ffffffff814dacd0>] ? panic+0x120/0x143 <<EOE>> ------------[ cut here ]------------ kernel BUG at arch/x86/kernel/traps.c:547! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/uevent_helper CPU 0 Modules linked in: microcode(+) tg3 hpwdt hpilo ipv6 freq_table sunrpc dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod Modules linked in: microcode(+) tg3 hpwdt hpilo ipv6 freq_table sunrpc dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod Pid: 0, comm: swapper Not tainted 2.6.32-131.0.15.el6.x86_64 #1 ProLiant DL360 G4p RIP: 0010:[<ffffffff814dea1a>] [<ffffffff814dea1a>] do_nmi+0x22a/0x2b0 RSP: 0018:ffff880002207f28 EFLAGS: 00010002 RAX: ffffffff81a01fd8 RBX: ffff880002207f58 RCX: 00000000c0000101 RDX: 00000000ffff8800 RSI: ffffffffffffffff RDI: ffff880002207f58 RBP: ffff880002207f48 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000001 R14: ffff880002207de8 R15: ffff880002207f58 FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000004b47a5 CR3: 0000000003a25000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a2d020) Stack: 0000000000000000 0000000000000001 0000000000000000 0000000000000001 <0> ffff880002207cb8 ffffffff814de270 ffff880002207f58 ffff880002207de8 <0> 0000000000000001 0000000000000000 ffff880002207cb8 ffffffff8178de38 Call Trace: <NMI> [<ffffffff814de270>] nmi+0x20/0x30 [<ffffffff814dacd0>] ? panic+0x120/0x143 <<EOE>> Code: ff ff 83 3d 48 18 83 00 00 75 28 83 3d 63 18 83 00 00 75 1f 48 c7 c7 f8 79 77 81 31 c0 e8 e2 c2 ff ff e9 2d fe ff ff 0f 0b eb fe <0f> 0b 0f 1f 40 00 eb fa 48 c7 c7 e7 3a 77 81 31 c0 e8 80 c1 ff RIP [<ffffffff814dea1a>] do_nmi+0x22a/0x2b0 RSP <ffff880002207f28> ---[ end trace 9cc41640835553a3 ]--- Kernel panic - not syncing: Fatal exception in interrupt Pid: 0, comm: swapper Tainted: G D ---------------- 2.6.32-131.0.15.el6.x86_64 #1 Call Trace: <NMI> [<ffffffff814dac28>] ? panic+0x78/0x143 [<ffffffff814dec82>] ? oops_end+0xf2/0x100 [<ffffffff8100f2fb>] ? die+0x5b/0x90 [<ffffffff814de544>] ? do_trap+0xc4/0x160 [<ffffffff8100ceb5>] ? do_invalid_op+0x95/0xb0 [<ffffffff814dea1a>] ? do_nmi+0x22a/0x2b0 [<ffffffff814e06c6>] ? kprobe_exceptions_notify+0x16/0x430 [<ffffffff8100bf5b>] ? invalid_op+0x1b/0x20 [<ffffffff814dea1a>] ? do_nmi+0x22a/0x2b0 [<ffffffff814de80c>] ? do_nmi+0x1c/0x2b0 [<ffffffff814de270>] ? nmi+0x20/0x30 [<ffffffff814dacd0>] ? panic+0x120/0x143 <-SNIP-> Expected results: Kexec kernel boot successfully. Additional info: System hostname for above testing in following comment. -pbunyan
Hard lockup happened in idle thread, kernel seems stuck at static void mwait_idle(void) { if (!need_resched()) { trace_power_start(POWER_CSTATE, 1, smp_processor_id()); if (cpu_has(¤t_cpu_data, X86_FEATURE_CLFLUSH_MONITOR)) clflush((void *)¤t_thread_info()->flags); __monitor((void *)¤t_thread_info()->flags, 0, 0); smp_mb(); if (!need_resched()) __sti_mwait(0, 0); else local_irq_enable(); } else local_irq_enable(); } Prarit, any ideas?
I was poking at this box recently. I can reproduce the 6.1 hang without to much effort. However, updating to the latest 6.2 tools/kernel, resulted in kdump recovering from the hang (took a couple of minutes but it recovered). Kdump failed because it had trouble mounting the filesystem because the cciss driver can't do its thing. Snippet below: Loading i6300esb.ko module i6300ESB timer: Intel 6300ESB WatchDog Timer Driver v0.04 i6300ESB timer: initialized (0xffffc900000ea000). heartbeat=30 sec (nowayout=0) Loading shpchp.ko module shpchp: Standard Hot Plug PCI Controller Driver version: 0.4 Loading edac_core.ko module EDAC MC: Ver: 2.1.0 Oct 25 2011 Loading mbcache.ko module Loading jbd2.ko module Loading cdrom.ko module Loading hpsa.ko module HP HPSA Driver (v 2.0.2-3) hpsa 0000:02:01.0: unrecognized board ID: 0x40910e11, ignoring. hpsa 0000:02:01.0: Not resetting device. Loading cciss.ko module HP CISS Driver (v 3.6.28-RH1) cciss 0000:02:01.0: using PCI PM to reset controller cciss 0000:02:01.0: Refused to change power state, currently in D3 cciss 0000:02:01.0: enabling device (0000 -> 0003) cciss 0000:02:01.0: PCI INT A -> GSI 24 (level, low) -> IRQ 24 cciss 0000:02:01.0: Waiting for board to reset. cciss 0000:02:01.0: board not ready, timed out. cciss 0000:02:01.0: failed waiting for board to become ready after hard reset Loading pata_acpi.ko module pata_acpi 0000:00:1f.1: PCI INT A -> GSI 18 (level, low) -> IRQ 18 pata_acpi 0000:00:1f.1: PCI INT A disabled Loading ata_generic.ko module Loading ata_piix.ko module ata_piix 0000:00:1f.1: PCI INT A -> GSI 18 (level, low) -> IRQ 18 scsi0 : ata_piix scsi1 : ata_piix ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0x500 irq 14 ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0x508 irq 15 ata1.00: ATAPI: HL-DT-STCD-RW/DVD DRIVE GCC-4244N, 2.00, max UDMA/33 ata1.00: configured for UDMA/33 scsi 0:0:0:0: CD-ROM HL-DT-ST RW/DVD GCC-4244N 2.00 PQ: 0 ANSI: 5 scsi 0:0:0:0: Attached scsi generic sg0 type 5 Loading cpufreq_ondemand.ko module Loading acpi-cpufreq.ko module Loading iTCO_wdt.ko module iTCO_wdt: Intel TCO WatchDog Timer Driver v1.05 iTCO_wdt: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS Loading e752x_edac.ko module Contact your BIOS vendor to see if the E752x error registers can be safely un-hidden Loading ext4.ko module Loading sr_mod.ko module sr0: scsi3-mmc drive: 24x/24x writer cd/rw xa/form2 cdda tray Uniform CD-ROM driver Revision: 3.20 Waiting for required block device discovery Creating Block Devices Creating block device loop0 Creating block device loop1 Creating block device loop2 Creating block device loop3 Creating block device loop4 Creating block device loop5 Creating block device loop6 Creating block device loop7 Creating block device ram0 Creating block device ram1 Creating block device ram10 Creating block device ram11 Creating block device ram12 Creating block device ram13 Creating block device ram14 Creating block device ram15 Creating block device ram2 Creating block device ram3 Creating block device ram4 Creating block device ram5 Creating block device ram6 Creating block device ram7 Creating block device ram8 Creating block device ram9 Creating block device sr0 Making device-mapper control node Scanning logical volumes Reading all physical volumes. This may take a while... No volume groups found No volume groups found Activating logical volumes No volume groups found No volume groups found Free memory/Total memory (free %): 206176 / 243020 ( 84.8391 ) Saving to the local filesystem /dev/mapper/vg_hpdl360g401-lv_root e2fsck 1.41.12 (17-May-2010) fsck.ext4: No such file or directory while trying to open /dev/mapper/vg_hpdl360g401-lv_root The superblock could not be read or does not describe a correct ext2 filesystem. If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock: e2fsck -b 8193 <device> mount: mounting /dev/mapper/vg_hpdl360g401-lv_root on /mnt failed: No such file or directory Attempting to enter user-space to capture vmcore Resetting kernel time value to BIOS time and timezone value to UTC. Free memory/Total memory (free %): 206176 / 243020 ( 84.8391 ) Creating root device. Free memory/Total memory (free %): 206236 / 243020 ( 84.8638 ) Checking root filesystem. fsck (busybox 1.15.1, 2010-11-30 08:10:31 EST) e2fsck 1.41.12 (17-May-2010) fsck.ext4: No such file or directory while trying to open /dev/mapper/vg_hpdl360g401-lv_root The superblock could not be read or does not describe a correct ext2 filesystem. If the device is valid and it really contains an ext2 filesystem (and not swap or ufs or something else), then the superblock is corrupt, and you might try running e2fsck with an alternate superblock: e2fsck -b 8193 <device> Mounting root filesystem: mount -t ext4 /dev/mapper/vg_hpdl360g401-lv_root /sysroot unable to mount rootfs. Dropping to shell / # / I should probably re-assign this to someone like Tomas Henzl who looks after the cciss driver. But I think all the strange panics and hangs on my end have disappeared through various fixes in the kernel. Cheers, Don
Which Smart Array is this? I'm guessing from the output in comment 9 it's a P600. If so, I just recently submitted a minor change to delay for 1/2 second in the reset code. That seems to resolve this issue.
(In reply to comment #10) > Which Smart Array is this? I'm guessing from the output in comment 9 it's a > P600. If so, I just recently submitted a minor change to delay for 1/2 second > in the reset code. That seems to resolve this issue. Hi Mike, Where can I find that patch to try it? Cheers, Don
Created attachment 530676 [details] Patch to add 500ms delay in PCI PM reset code Don, I just attached the patch to the BZ. This one is actually for upstream (can't find the ones I did for RH, arghhhhh). It should apply with an offset. But as you can see it's very simple. -- mikem
Thanks Mike. That fix worked for me. Cheers, Don
(In reply to comment #13) > Thanks Mike. That fix worked for me. > > Cheers, > Don Excellent. Ship it! :)
*** This bug has been marked as a duplicate of bug 746317 ***