Bug 690044

Summary: RHEL4.9 guest NMI watchdog detect LOCKUP during boot on RHEL6.1 AMD host
Product: Red Hat Enterprise Linux 6 Reporter: Qingtang Zhou <qzhou>
Component: qemu-kvmAssignee: Gleb Natapov <gleb>
Status: CLOSED WONTFIX QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.1CC: knoel, michen, mkenneth, tburke, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
nmi_watchdog is not working for rhel4 guest kernels
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-03 18:09:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
guest dmesg
none
guest dmesg (virtio block) none

Description Qingtang Zhou 2011-03-23 06:08:31 UTC
Created attachment 486962 [details]
guest dmesg

Description of problem:
RHEL4.9 guest has 10% probability kernel panic during boot, and 80% probability hang during boot.

Version-Release number of selected component (if applicable):
host:
# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.150.el6.x86_64
# rpm -q kernel
2.6.32-120.el6.x86_64


How reproducible:
10% kernel panic
80% hang

Steps to Reproduce:
1. start guest with cmd:
qemu -name 'vm1' 
-chardev socket,id=human_monitor_MTkB,path=/tmp/monitor-humanmonitor1-20110319-152410-zRkw,server,nowait 
-mon chardev=human_monitor_MTkB,mode=readline 
-chardev socket,id=serial_Yxc0,path=/tmp/serial-20110319-152410-zRkw,server,nowait 
-device isa-serial,chardev=serial_Yxc0 -drive file='RHEL-4.9-64-virtio.raw',index=0,if=none,id=drive-ide0-0-0,media=disk,cache=writethrough,format=raw,aio=native 
-device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 
-device virtio-net-pci,netdev=idOpH6lL,mac=9a:f6:24:35:ce:19,id=ndev00idOpH6lL,bus=pci.0,addr=0x3 
-netdev tap,id=idOpH6lL,vhost=on,ifname='t0-152410-zRkw',script='qemu-ifup-switch',downscript='no' 
-m 4096 -smp 2,cores=1,threads=1,sockets=2 
-cpu cpu64-rhel6,vendor="RED HAT PROD",+sse2,+x2apic 
-spice port=8000,disable-ticketing -vga qxl -rtc base=utc,clock=host,driftfix=none 
-M rhel6.1.0 -boot order=cdn,once=c,menu=off   -usbdevice tablet -no-kvm-pit-reinjection -enable-kvm

2. guest hang (or kernel panic)

  
Actual results:
guest hang (or kernel panic)

Expected results:
guest runs well, no hang, no panic

Additional info:
guest call trace:

NMI Watchdog detected LOCKUP, CPU=0, registers:
CPU 0
Modules linked in: parport_pc lp parport autofs4 sunrpc ds yenta_socket pcmcia_core cpufreq_powersave loop joydev button battery ac uhci_hcd floppy virtio_blk virtio_net virtio_pci virtio_ring virtio dm_snapshot dm_zero dm_mirror ext3 jbd dm_mod
Pid: 468, comm: kjournald Not tainted 2.6.9-100.ELsmp
RIP: 0010:[<ffffffff80110919>] <ffffffff80110919>{iret_label+0}
RSP: 0000:ffffffff804732d8  EFLAGS: 00000086
RAX: 0000000000000001 RBX: ffffffff804e5de0 RCX: 00000000000001f6
RDX: 000000000000c000 RSI: 000000000000c000 RDI: 0000000000000001
RBP: ffffffff804e5f28 R08: 0000000000004e1f R09: 0000000000000000
R10: 0000010037dbc000 R11: ffffffff8027d272 R12: ffffffff804e5de0
R13: ffffffff804e5de0 R14: 0000000000000008 R15: 00000101190c5db0
FS:  0000002a95aad6e0(0000) GS:ffffffff80506900(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000002a958860ec CR3: 0000000000101000 CR4: 00000000000006e0
Process kjournald (pid: 468, threadinfo 000001011be3c000, task 0000010037dd9030)
Stack: ffffffff80110919 0000000000000010 0000000000000086 ffffffff804732d8
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
       0000000000000000 0000000000000000
Call Trace:<ffffffff80110919>{iret_label+0}  <EOE> <ffffffff80110919>{iret_label+0}


Code: 48 cf 0f ba e2 03 73 1b fb 57 e8 62 43 20 00 5f 65 48 8b 0c
Kernel panic - not syncing: nmi watchdog
 Badness in panic at kernel/panic.c:121

Call Trace:<ffffffff8013881d>{panic+558} <ffffffff801118dc>{show_stack+241}
<ffffffff80111a06>{show_registers+277} <ffffffff80111d0d>{die_nmi+130}
<ffffffff8011de1b>{nmi_watchdog_tick+276} <ffffffff801125de>{default_do_nmi+116}
<ffffffff8011df05>{do_nmi+115} <ffffffff801111eb>{paranoid_exit+0}
<ffffffff8027d272>{__ide_dma_begin+0} <ffffffff80110919>{iret_label+0}
 <EOE> <ffffffff80110919>{iret_label+0}
Badness in i8042_panic_blink at drivers/input/serio/i8042.c:987

Call Trace:<ffffffff80249cf5>{i8042_panic_blink+239} <ffffffff801387cb>{panic+476}
<ffffffff801118dc>{show_stack+241} <ffffffff80111a06>{show_registers+277}
<ffffffff80111d0d>{die_nmi+130} <ffffffff8011de1b>{nmi_watchdog_tick+276}
<ffffffff801125de>{default_do_nmi+116} <ffffffff8011df05>{do_nmi+115}
<ffffffff801111eb>{paranoid_exit+0} <ffffffff8027d272>{__ide_dma_begin+0}
<ffffffff80110919>{iret_label+0}  <EOE> <ffffffff80110919>{iret_label+0}

Badness in i8042_panic_blink at drivers/input/serio/i8042.c:990

Call Trace:<ffffffff80249d87>{i8042_panic_blink+385} <ffffffff801387cb>{panic+476}
<ffffffff801118dc>{show_stack+241} <ffffffff80111a06>{show_registers+277}
<ffffffff80111d0d>{die_nmi+130} <ffffffff8011de1b>{nmi_watchdog_tick+276}
<ffffffff801125de>{default_do_nmi+116} <ffffffff8011df05>{do_nmi+115}
<ffffffff801111eb>{paranoid_exit+0} <ffffffff8027d272>{__ide_dma_begin+0}
<ffffffff80110919>{iret_label+0}  <EOE> <ffffffff80110919>{iret_label+0}

Badness in i8042_panic_blink at drivers/input/serio/i8042.c:992

Call Trace:<ffffffff80249dec>{i8042_panic_blink+486} <ffffffff801387cb>{panic+476}
<ffffffff801118dc>{show_stack+241} <ffffffff80111a06>{show_registers+277}
<ffffffff80111d0d>{die_nmi+130} <ffffffff8011de1b>{nmi_watchdog_tick+276}
<ffffffff801125de>{default_do_nmi+116} <ffffffff8011df05>{do_nmi+115}
<ffffffff801111eb>{paranoid_exit+0} <ffffffff8027d272>{__ide_dma_begin+0}
<ffffffff80110919>{iret_label+0}  <EOE> <ffffffff80110919>{iret_label+0}

Comment 2 Dor Laor 2011-03-23 14:14:54 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
nmi_watchdog is not working for rhel4 guest kernels

Comment 3 Gleb Natapov 2011-03-23 16:29:52 UTC
(In reply to comment #2)
>     Technical note added. If any revisions are required, please edit the
> "Technical Notes" field
>     accordingly. All revisions will be proofread by the Engineering Content
> Services team.
> 
>     New Contents:
> nmi_watchdog is not working for rhel4 guest kernels

I do not see how nmi_watchdog is to blame here.

Comment 4 Gleb Natapov 2011-03-23 16:31:05 UTC
First of all cache=writethrough should not be used. Can you reproduce with cache=none? Does it hangs with virtio block too?

Comment 5 Qingtang Zhou 2011-03-24 05:05:24 UTC
(In reply to comment #4)
> First of all cache=writethrough should not be used. Can you reproduce with
> cache=none? Does it hangs with virtio block too?
Hi Gleb,
I have tried ide/virtio_blk drive with "cache=none", guest still hang during boot.


btw, do you mean "cache=writethrough" should not be used for all RHEL4.9 guest with raw format image? or we should use 'cache=none' for all guest with raw image?

Comment 6 Gleb Natapov 2011-03-24 05:10:37 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > First of all cache=writethrough should not be used. Can you reproduce with
> > cache=none? Does it hangs with virtio block too?
> Hi Gleb,
> I have tried ide/virtio_blk drive with "cache=none", guest still hang during
> boot.
Can you post dmesg of the hang with virtio_blk please?

> 
> 
> btw, do you mean "cache=writethrough" should not be used for all RHEL4.9 guest
> with raw format image? or we should use 'cache=none' for all guest with raw
> image?

IIRC we should always use cache=none regardless of disk format.

Comment 7 Qingtang Zhou 2011-03-24 07:20:44 UTC
Created attachment 487237 [details]
guest dmesg (virtio block)

Comment 8 Gleb Natapov 2011-03-24 07:40:37 UTC
(In reply to comment #7)
> Created attachment 487237 [details]
> guest dmesg (virtio block)

I do not see any lockup message there. Can you post one with lockup message? BTW what "info status"  and "info cpus" shows in monitor when it hang?

Comment 9 Qingtang Zhou 2011-03-24 08:02:48 UTC
(In reply to comment #8)
> 
> I do not see any lockup message there. Can you post one with lockup message?
> BTW what "info status"  and "info cpus" shows in monitor when it hang?

yep, it didn't lockup this time, I tried about 20 times today, no lockup occurred, but hang 80%. I'll continue try to let it lockup.

monitor output:
(qemu) info status
info status
VM status: running
(qemu) info cpus
info cpus
* CPU #0: pc=0xffffffff80110919 thread_id=5052 
  CPU #1: pc=0xffffffff8011cd16 thread_id=5060

Comment 10 Qingtang Zhou 2011-03-24 09:15:10 UTC
continue to try reboot guest, I find cpu halted:

(qemu) info status
info status
VM status: running
(qemu) info cpus
info cpus
* CPU #0: pc=0xffffffff80110919 thread_id=5170 
  CPU #1: pc=0xffffffff8010e7a9 (halted) thread_id=5171 
(qemu)


still no NMI lockup occur...

Comment 11 Qingtang Zhou 2011-03-24 10:10:08 UTC
use gdb to attach this cpu #1 thread:

(gdb) bt
#0  0x00000036122dde87 in ioctl () from /lib64/libc.so.6
#1  0x000000000042d03f in kvm_run (env=0x2c30010) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:927
#2  0x000000000042d4c9 in kvm_cpu_exec (env=<value optimized out>) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1663
#3  0x000000000042e20f in kvm_main_loop_cpu (_env=0x2c30010) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1931
#4  ap_main_loop (_env=0x2c30010) at /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1981
#5  0x00000036126077e1 in start_thread () from /lib64/libpthread.so.0
#6  0x00000036122e5dcd in clone () from /lib64/libc.so.6

Comment 12 Gleb Natapov 2011-03-24 10:16:12 UTC
(In reply to comment #11)
> use gdb to attach this cpu #1 thread:
> 
> (gdb) bt
> #0  0x00000036122dde87 in ioctl () from /lib64/libc.so.6
> #1  0x000000000042d03f in kvm_run (env=0x2c30010) at
> /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:927
> #2  0x000000000042d4c9 in kvm_cpu_exec (env=<value optimized out>) at
> /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1663
> #3  0x000000000042e20f in kvm_main_loop_cpu (_env=0x2c30010) at
> /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1931
> #4  ap_main_loop (_env=0x2c30010) at
> /usr/src/debug/qemu-kvm-0.12.1.2/qemu-kvm.c:1981
> #5  0x00000036126077e1 in start_thread () from /lib64/libpthread.so.0
> #6  0x00000036122e5dcd in clone () from /lib64/libc.so.6

That is useless. Better do ftrace.

Comment 14 Gleb Natapov 2011-06-03 18:09:37 UTC
nmi watchdog in a guest is not supported. Closing.