Bug 717391 - [RHEL6.1] PANIC booting kexec kernel: "Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0"
Summary: [RHEL6.1] PANIC booting kexec kernel: "Kernel panic - not syncing: Watchdog d...
Keywords:
Status: CLOSED DUPLICATE of bug 746317
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Don Zickus
QA Contact: Chao Ye
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-06-28 18:06 UTC by PaulB
Modified: 2011-10-28 16:18 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-10-28 16:18:47 UTC
Target Upstream Version:


Attachments (Terms of Use)
Patch to add 500ms delay in PCI PM reset code (1.09 KB, patch)
2011-10-28 15:02 UTC, Mike Miller (OS Dev)
no flags Details | Diff

Description PaulB 2011-06-28 18:06:25 UTC
Description of problem:
System PANIC'd while booting the kexec kernel: 
"Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0"

Version-Release number of selected component (if applicable):
 2.6.32-131.0.15.el6.x86_64

How reproducible:
Issue reproduced here:
[] https://beaker.engineering.redhat.com/jobs/101896
   http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/06/1018/101896/208470//console.log

Steps to Reproduce:
1. Install system w RHEL6.1 (distro RHEL6.1-20110510.1)
2. Configure system for kdump
3. Crash system with #echo c > /proc/sysrq-trigger 
  
Actual results:
 https://beaker.engineering.redhat.com/jobs/99230
 http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/06/992/99230/202012//console.log
 <-SNIP->
  Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0 
  Pid: 0, comm: swapper Not tainted 2.6.32-131.0.15.el6.x86_64 #1 
  Call Trace: 
  <NMI>  [<ffffffff814dac28>] ? panic+0x78/0x143 
  [<ffffffff810d649d>] ? watchdog_overflow_callback+0xcd/0xd0 
  [<ffffffff81108b56>] ? __perf_event_overflow+0x116/0x290 
  [<ffffffff81109149>] ? perf_event_overflow+0x19/0x20 
  [<ffffffff8101cde4>] ? p4_pmu_handle_irq+0x224/0x2f0 
  [<ffffffff814e06c6>] ? kprobe_exceptions_notify+0x16/0x430 
  [<ffffffff814df1b8>] ? perf_event_nmi_handler+0x58/0xe0 
  [<ffffffff814e0cf5>] ? notifier_call_chain+0x55/0x80 
  [<ffffffff814e0d5a>] ? atomic_notifier_call_chain+0x1a/0x20 
  [<ffffffff810940fe>] ? notify_die+0x2e/0x30 
  [<ffffffff814de963>] ? do_nmi+0x173/0x2b0 
  [<ffffffff814de270>] ? nmi+0x20/0x30 
  [<ffffffff810141a7>] ? mwait_idle+0x77/0xd0 
  <<EOE>>  [<ffffffff81009e96>] ? cpu_idle+0xb6/0x110 
  [<ffffffff814c376a>] ? rest_init+0x7a/0x80 
  [<ffffffff81bbdf28>] ? start_kernel+0x41d/0x429 
  [<ffffffff81bbd33a>] ? x86_64_start_reservations+0x125/0x129 
  [<ffffffff81bbd438>] ? x86_64_start_kernel+0xfa/0x109 
  BUG: scheduling while atomic: swapper/0/0x14010000 
  Modules linked in: microcode(+) tg3 hpwdt hpilo ipv6 freq_table sunrpc   
  dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod 
  Pid: 0, comm: swapper Not tainted 2.6.32-131.0.15.el6.x86_64 #1 
  Call Trace: 
  <NMI>  [<ffffffff81055cf6>] ? __schedule_bug+0x66/0x70 
  [<ffffffff814db8c2>] ? thread_return+0x5d9/0x777 
  [<ffffffff814dacd0>] ? panic+0x120/0x143 
  [<ffffffff814dea1a>] ? do_nmi+0x22a/0x2b0 
  [<ffffffff8105faba>] ? __cond_resched+0x2a/0x40 
  [<ffffffff814dea1a>] ? do_nmi+0x22a/0x2b0 
  [<ffffffff814dbbb0>] ? _cond_resched+0x30/0x40 
  [<ffffffff8100dff6>] ? is_valid_bugaddr+0x16/0x40 
  [<ffffffff812624af>] ? report_bug+0x1f/0xc0 
  [<ffffffff8100f31f>] ? die+0x7f/0x90 
  [<ffffffff814de544>] ? do_trap+0xc4/0x160 
  [<ffffffff8100ceb5>] ? do_invalid_op+0x95/0xb0 
  [<ffffffff814dea1a>] ? do_nmi+0x22a/0x2b0 
  [<ffffffff814e06c6>] ? kprobe_exceptions_notify+0x16/0x430 
  [<ffffffff8100bf5b>] ? invalid_op+0x1b/0x20 
  [<ffffffff814dea1a>] ? do_nmi+0x22a/0x2b0 
  [<ffffffff814de80c>] ? do_nmi+0x1c/0x2b0 
  [<ffffffff814de270>] ? nmi+0x20/0x30 
  [<ffffffff814dacd0>] ? panic+0x120/0x143 
  <<EOE>>  
  ------------[ cut here ]------------ 
  kernel BUG at arch/x86/kernel/traps.c:547! 
  invalid opcode: 0000 [#1] SMP  
  last sysfs file: /sys/kernel/uevent_helper 
  CPU 0  
  Modules linked in: microcode(+) tg3 hpwdt hpilo ipv6 freq_table sunrpc     
  dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod 
  Modules linked in: microcode(+) tg3 hpwdt hpilo ipv6 freq_table sunrpc   
  dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod 
  Pid: 0, comm: swapper Not tainted 2.6.32-131.0.15.el6.x86_64 #1 ProLiant 
  DL360 G4p 
  RIP: 0010:[<ffffffff814dea1a>]  [<ffffffff814dea1a>] do_nmi+0x22a/0x2b0 
  RSP: 0018:ffff880002207f28  EFLAGS: 00010002 
  RAX: ffffffff81a01fd8 RBX: ffff880002207f58 RCX: 00000000c0000101 
  RDX: 00000000ffff8800 RSI: ffffffffffffffff RDI: ffff880002207f58 
  RBP: ffff880002207f48 R08: 0000000000000000 R09: 0000000000000000 
  R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000000000 
  R13: 0000000000000001 R14: ffff880002207de8 R15: ffff880002207f58 
  FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000 
  CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b 
  CR2: 00000000004b47a5 CR3: 0000000003a25000 CR4: 00000000000006f0 
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
  Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a2d020) 
  Stack: 
  0000000000000000 0000000000000001 0000000000000000 0000000000000001 
  <0> ffff880002207cb8 ffffffff814de270 ffff880002207f58 ffff880002207de8 
  <0> 0000000000000001 0000000000000000 ffff880002207cb8 ffffffff8178de38 
  Call Trace: 
  <NMI>  
  [<ffffffff814de270>] nmi+0x20/0x30 
  [<ffffffff814dacd0>] ? panic+0x120/0x143 
  <<EOE>>  
  Code: ff ff 83 3d 48 18 83 00 00 75 28 83 3d 63 18 83 00 00 75 1f 48 c7 c7 f8 
  79 77 81 31 c0 e8 e2 c2 ff ff e9 2d fe ff ff 0f 0b eb fe <0f> 0b 0f 1f 40 
  00   eb fa 48 c7 c7 e7 3a 77 81 31 c0 e8 80 c1 ff  
  RIP  [<ffffffff814dea1a>] do_nmi+0x22a/0x2b0 
  RSP <ffff880002207f28> 
  ---[ end trace 9cc41640835553a3 ]--- 
  Kernel panic - not syncing: Fatal exception in interrupt 
  Pid: 0, comm: swapper Tainted: G      D    ----------------     
  2.6.32-131.0.15.el6.x86_64 #1 
  Call Trace: 
  <NMI>  [<ffffffff814dac28>] ? panic+0x78/0x143 
  [<ffffffff814dec82>] ? oops_end+0xf2/0x100 
  [<ffffffff8100f2fb>] ? die+0x5b/0x90 
  [<ffffffff814de544>] ? do_trap+0xc4/0x160 
  [<ffffffff8100ceb5>] ? do_invalid_op+0x95/0xb0 
  [<ffffffff814dea1a>] ? do_nmi+0x22a/0x2b0 
  [<ffffffff814e06c6>] ? kprobe_exceptions_notify+0x16/0x430 
  [<ffffffff8100bf5b>] ? invalid_op+0x1b/0x20 
  [<ffffffff814dea1a>] ? do_nmi+0x22a/0x2b0 
  [<ffffffff814de80c>] ? do_nmi+0x1c/0x2b0 
  [<ffffffff814de270>] ? nmi+0x20/0x30 
  [<ffffffff814dacd0>] ? panic+0x120/0x143 
 <-SNIP->

Expected results:
 Kexec kernel boot successfully.

Additional info:
 System hostname for above testing in following comment.

-pbunyan

Comment 3 Cong Wang 2011-06-29 09:10:32 UTC
Hard lockup happened in idle thread, kernel seems stuck at

static void mwait_idle(void)
{
        if (!need_resched()) {
                trace_power_start(POWER_CSTATE, 1, smp_processor_id());
                if (cpu_has(&current_cpu_data, X86_FEATURE_CLFLUSH_MONITOR))
                        clflush((void *)&current_thread_info()->flags);

                __monitor((void *)&current_thread_info()->flags, 0, 0);
                smp_mb();    
                if (!need_resched())
                        __sti_mwait(0, 0);
                else
                        local_irq_enable();
        } else  
                local_irq_enable();
}

Prarit, any ideas?

Comment 9 Don Zickus 2011-10-27 20:57:32 UTC
I was poking at this box recently.  I can reproduce the 6.1 hang without to much effort.  However, updating to the latest 6.2 tools/kernel, resulted in kdump recovering from the hang (took a couple of minutes but it recovered).

Kdump failed because it had trouble mounting the filesystem because the cciss driver can't do its thing.

Snippet below:

Loading i6300esb.ko module
i6300ESB timer: Intel 6300ESB WatchDog Timer Driver v0.04
i6300ESB timer: initialized (0xffffc900000ea000). heartbeat=30 sec
(nowayout=0)
Loading shpchp.ko module
shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
Loading edac_core.ko module
EDAC MC: Ver: 2.1.0 Oct 25 2011
Loading mbcache.ko module
Loading jbd2.ko module
Loading cdrom.ko module
Loading hpsa.ko module
HP HPSA Driver (v 2.0.2-3)
hpsa 0000:02:01.0: unrecognized board ID: 0x40910e11, ignoring.
hpsa 0000:02:01.0: Not resetting device.
Loading cciss.ko module
HP CISS Driver (v 3.6.28-RH1)
cciss 0000:02:01.0: using PCI PM to reset controller
cciss 0000:02:01.0: Refused to change power state, currently in D3
cciss 0000:02:01.0: enabling device (0000 -> 0003)
cciss 0000:02:01.0: PCI INT A -> GSI 24 (level, low) -> IRQ 24
cciss 0000:02:01.0: Waiting for board to reset.




cciss 0000:02:01.0: board not ready, timed out.
cciss 0000:02:01.0: failed waiting for board to become ready after hard
reset
Loading pata_acpi.ko module
pata_acpi 0000:00:1f.1: PCI INT A -> GSI 18 (level, low) -> IRQ 18
pata_acpi 0000:00:1f.1: PCI INT A disabled
Loading ata_generic.ko module
Loading ata_piix.ko module
ata_piix 0000:00:1f.1: PCI INT A -> GSI 18 (level, low) -> IRQ 18
scsi0 : ata_piix
scsi1 : ata_piix
ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0x500 irq 14
ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0x508 irq 15
ata1.00: ATAPI: HL-DT-STCD-RW/DVD DRIVE GCC-4244N, 2.00, max UDMA/33
ata1.00: configured for UDMA/33
scsi 0:0:0:0: CD-ROM            HL-DT-ST RW/DVD GCC-4244N 2.00 PQ: 0 ANSI: 5
scsi 0:0:0:0: Attached scsi generic sg0 type 5
Loading cpufreq_ondemand.ko module
Loading acpi-cpufreq.ko module
Loading iTCO_wdt.ko module
iTCO_wdt: Intel TCO WatchDog Timer Driver v1.05
iTCO_wdt: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS
Loading e752x_edac.ko module
Contact your BIOS vendor to see if the E752x error registers can be safely
un-hidden
Loading ext4.ko module
Loading sr_mod.ko module
sr0: scsi3-mmc drive: 24x/24x writer cd/rw xa/form2 cdda tray
Uniform CD-ROM driver Revision: 3.20
Waiting for required block device discovery
Creating Block Devices
Creating block device loop0
Creating block device loop1
Creating block device loop2
Creating block device loop3
Creating block device loop4
Creating block device loop5
Creating block device loop6
Creating block device loop7
Creating block device ram0
Creating block device ram1
Creating block device ram10
Creating block device ram11
Creating block device ram12
Creating block device ram13
Creating block device ram14
Creating block device ram15
Creating block device ram2
Creating block device ram3
Creating block device ram4
Creating block device ram5
Creating block device ram6
Creating block device ram7
Creating block device ram8
Creating block device ram9
Creating block device sr0
Making device-mapper control node
Scanning logical volumes
  Reading all physical volumes.  This may take a while...
  No volume groups found
  No volume groups found
Activating logical volumes
  No volume groups found
  No volume groups found
Free memory/Total memory (free %): 206176 / 243020 ( 84.8391 )
Saving to the local filesystem /dev/mapper/vg_hpdl360g401-lv_root
e2fsck 1.41.12 (17-May-2010)
fsck.ext4: No such file or directory while trying to open
/dev/mapper/vg_hpdl360g401-lv_root

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

mount: mounting /dev/mapper/vg_hpdl360g401-lv_root on /mnt failed: No such
file or directory
Attempting to enter user-space to capture vmcore
Resetting kernel time value to BIOS time and timezone value to UTC.
Free memory/Total memory (free %): 206176 / 243020 ( 84.8391 )
Creating root device.
Free memory/Total memory (free %): 206236 / 243020 ( 84.8638 )
Checking root filesystem.
fsck (busybox 1.15.1, 2010-11-30 08:10:31 EST)
e2fsck 1.41.12 (17-May-2010)
fsck.ext4: No such file or directory while trying to open
/dev/mapper/vg_hpdl360g401-lv_root

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

Mounting root filesystem: mount -t ext4 /dev/mapper/vg_hpdl360g401-lv_root
/sysroot
unable to mount rootfs. Dropping to shell
/ #
/


I should probably re-assign this to someone like Tomas Henzl who looks after the cciss driver.

But I think all the strange panics and hangs on my end have disappeared through various fixes in the kernel.

Cheers,
Don

Comment 10 Mike Miller (OS Dev) 2011-10-27 21:10:16 UTC
Which Smart Array is this? I'm guessing from the output in comment 9 it's a P600. If so, I just recently submitted a minor change to delay for 1/2 second in the reset code. That seems to resolve this issue.

Comment 11 Don Zickus 2011-10-28 14:33:54 UTC
(In reply to comment #10)
> Which Smart Array is this? I'm guessing from the output in comment 9 it's a
> P600. If so, I just recently submitted a minor change to delay for 1/2 second
> in the reset code. That seems to resolve this issue.

Hi Mike,

Where can I find that patch to try it?

Cheers,
Don

Comment 12 Mike Miller (OS Dev) 2011-10-28 15:02:50 UTC
Created attachment 530676 [details]
Patch to add 500ms delay in PCI PM reset code

Don,
I just attached the patch to the BZ. This one is actually for upstream (can't find the ones I did for RH, arghhhhh). It should apply with an offset. But as you can see it's very simple.

-- mikem

Comment 13 Don Zickus 2011-10-28 15:44:17 UTC
Thanks Mike.  That fix worked for me.

Cheers,
Don

Comment 14 Mike Miller (OS Dev) 2011-10-28 15:45:43 UTC
(In reply to comment #13)
> Thanks Mike.  That fix worked for me.
> 
> Cheers,
> Don

Excellent. Ship it! :)

Comment 16 Don Zickus 2011-10-28 16:18:47 UTC

*** This bug has been marked as a duplicate of bug 746317 ***


Note You need to log in before you can comment on or make changes to this bug.