Bug 689676

Summary: [AMD] RHEL5.6 SMP guest VM hang or kernel panic in bootup after setting nmi_watchdog=1
Product: Red Hat Enterprise Linux 6 Reporter: yacui
Component: qemu-kvmAssignee: Gleb Natapov <gleb>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.1CC: knoel, mkenneth, tburke, virt-maint, ypu
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-06-03 18:07:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 580954    
Attachments:
Description Flags
serial log for guest file-system broken
none
screen dump for guest file system broken
none
serial log for guest kernel panic
none
screen dump for guest kernel panic none

Description yacui 2011-03-22 05:22:49 UTC
Description of problem:
In the RHEL6.1 host, boot up a RHEL5.6 guest vm with more than 1 vcpu and nmi_watchdog=1,the guest might hang or kernel panic in the process of bootup

Version-Release number of selected component (if applicable):
Host Kernel:
2.6.32-121.el6.x86_64
Guest Kernel:
2.6.18-247.el5
KVM Version:
qemu-kvm-debuginfo-0.12.1.2-2.150.el6.x86_64
qemu-kvm-tools-0.12.1.2-2.150.el6.x86_64
qemu-kvm-0.12.1.2-2.150.el6.x86_64

How reproducible:
guest hang in bootup - 7 out of 300 times
guest kernel panic in bootup - 1 out of 300 times

Steps to Reproduce:
1. Bootup a normal RHEL5.6 guest
2. Adding nmi_watchdog=1 to kernel line
3. reboot the RHEL5.6 guest
  
Actual results:
after adding nmi_watchdog=1, the guest might hang sometimes in the boot up process.

Expected results:
the guest should boot up normally

Additional info:
1 CommandLine:
qemu-kvm -name 'vm1' -chardev socket,id=human_monitor_kMoF,path=/tmp/monitor-humanmonitor1-20110315-134747-luGU,server,nowait -mon chardev=human_monitor_kMoF,mode=readline -chardev socket,id=serial_T9FN,path=/tmp/serial-20110315-134747-luGU,server,nowait -device isa-serial,chardev=serial_T9FN -drive file='/home/kvm-qe/autotest/client/tests/kvm/images/RHEL-Server-5.6-64.raw',index=0,if=none,id=drive-ide0-0-0,media=disk,cache=writethrough,snapshot=on,format=raw,aio=native -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -device e1000,netdev=idM8ohr1,mac=9a:54:b1:a9:34:c7,netdev=idM8ohr1,id=ndev00idM8ohr1,bus=pci.0,addr=0x3 -netdev tap,id=idM8ohr1,ifname='t0-134747-luGU',script='/home/kvm-qe/autotest/client/tests/kvm/scripts/qemu-ifup-switch',downscript='no' -m 8192 -smp 4,cores=1,threads=1,sockets=4 -cpu cpu64-rhel6,+sse2,+x2apic -vnc :0 -rtc base=utc,clock=host,driftfix=none  -boot order=cdn,once=c,menu=off   -usbdevice tablet -no-kvm-pit-reinjection -enable-kvm

2 Host CPU Info:
processor : 1
vendor_id : AuthenticAMD
cpu family : 15
model  : 67
model name : Dual-Core AMD Opteron(tm) Processor 1216
stepping : 3
cpu MHz  : 1000.000
cache size : 1024 KB
physical id : 0
siblings : 2
core id  : 1
cpu cores : 2
apicid  : 1
initial apicid : 1
fpu  : yes
fpu_exception : yes
cpuid level : 1
wp  : yes
flags  : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36
clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext
3dnow rep_good extd_apicid pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy
bogomips : 2009.10
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc

3 Serial Output:
3.1 the guest hang during the startup process 
Last few lines of serial output in this scenario:
2011-03-16 22:25:56: uhci_hcd 0000:00:01.2: UHCI Host Controller
2011-03-16 22:25:56: uhci_hcd 0000:00:01.2: new USB bus registered, assigned
bus number 1
2011-03-16 22:25:56: uhci_hcd 0000:00:01.2: irq 11, io base 0x0000c020
2011-03-16 22:25:56: usb usb1: configuration #1 chosen from 1 choice
2011-03-16 22:25:56: hub 1-0:1.0: USB hub found
2011-03-16 22:25:56: hub 1-0:1.0: 2 ports detected
2011-03-16 22:25:56: input: ImExPS/2 Generic Explorer Mouse as
/class/input/input1
2011-03-16 22:25:56: usb 1-1: new full speed USB device using uhci_hcd and
address 2
2011-03-16 22:25:56: SCSI subsystem initialized
2011-03-16 22:25:56: usb 1-1: configuration #1 chosen from 1 choice
2011-03-16 22:25:56: input: QEMU 0.12.1 QEMU USB Tablet as /class/input/input2
2011-03-16 22:25:56: input: USB HID v0.01 Pointer [QEMU 0.12.1 QEMU USB Tablet]
on usb-0000:00:01.2-1
2011-03-16 22:25:56: device-mapper: uevent: version 1.0.3
2011-03-16 22:25:56: device-mapper: ioctl: 4.11.5-ioctl (2007-12-12)
initialised: dm-devel
2011-03-16 22:25:57: device-mapper: dm-raid45: initialized v0.2594l
2011-03-16 22:26:18: kjournald starting.  Commit interval 5 seconds
2011-03-16 22:26:18: EXT3-fs: mounted filesystem with ordered data mode.
2011-03-16 22:26:18: type=1404 audit(1300285577.849:2): enforcing=1
old_enforcing=0 auid=4294967295 ses=4294967295
2011-03-16 22:26:18: type=1403 audit(1300285578.112:3): policy loaded
auid=4294967295 ses=4294967295

3.2 Guest Kernel Panic
2011-03-22 04:46:02: TCP bic registered
2011-03-22 04:46:02: Initializing IPsec netlink socket
2011-03-22 04:46:02: input: AT Translated Set 2 keyboard as /class/input/input0
2011-03-22 04:46:02: NET: Registered protocol family 1
2011-03-22 04:46:02: NET: Registered protocol family 17
2011-03-22 04:46:02: ACPI: (supports S3 S4 S5)
2011-03-22 04:46:02: Initalizing network drop monitor service
2011-03-22 04:46:02: Freeing unused kernel memory: 224k freed
2011-03-22 04:46:02: Write protecting the kernel read-only data: 520k
2011-03-22 04:46:02: input: ImExPS/2 Generic Explorer Mouse as /class/input/input1
2011-03-22 04:46:12: Kernel panic - not syncing: Attempted to kill init!
2011-03-22 04:46:12:

Comment 2 yacui 2011-03-23 08:14:36 UTC
After more times of autotest, I discovered that the first phenomenon "guest hang during the startup process",which happens 7 out of 300 times,is a file system break and can be fixed manually by fsck.

And Currently there is 1 time of Guest Kernel Panic, serial information can be found in section 3.2 of comment 1.

Comment 3 Avi Kivity 2011-03-23 12:13:59 UTC
Is it a guest file system issue?  Or a host file system issue?  What's the cause?

Comment 4 yacui 2011-03-24 05:48:55 UTC
It's a guest file system issue. I am not clear about the reason why the guest would suffer from the file system issue. And the whole process is to first boot a normal guest, then set the nmi_watchdog, and finally reboot the guest, the guest might sometimes have a file system broken in the process of boot up.

I could also upload the full serial logs and screen dumps as attachments for reference.(both the file system broken one and kernel panic one)

Comment 5 yacui 2011-03-24 05:50:44 UTC
Created attachment 487209 [details]
serial log for guest file-system broken

Comment 6 yacui 2011-03-24 05:52:04 UTC
Created attachment 487211 [details]
screen dump for guest file system broken

Comment 7 yacui 2011-03-24 05:53:14 UTC
Created attachment 487213 [details]
serial log for guest kernel panic

Comment 8 yacui 2011-03-24 05:54:30 UTC
Created attachment 487214 [details]
screen dump for guest kernel panic

Comment 9 Gleb Natapov 2011-03-24 09:53:57 UTC
(In reply to comment #8)
> Created attachment 487214 [details]
> screen dump for guest kernel panic

This one is also due to guest file system corruption. Panic happens because files system can't be mounted.

Comment 10 Avi Kivity 2011-03-24 14:18:31 UTC
Don't see the breakage in the logs.

Comment 12 Gleb Natapov 2011-06-03 18:07:04 UTC
nmi_watchdog=1 is not supported. In addition it turned out that the hang is due to guest fs corruption anf not NMI watchdog. Closing.