Bug 1024257

Summary: [SVVP]Host crashed while multiple (4) guests running SVVP test on it
Product: Red Hat Enterprise Linux 6 Reporter: Min Deng <mdeng>
Component: qemu-kvmAssignee: Virtualization Maintenance <virt-maint>
Status: CLOSED NOTABUG QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 6.5CC: acathrow, areis, bcao, bsarathy, gleb, juzhang, michen, mkenneth, qzhang, rhod, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-10-29 12:49:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Min Deng 2013-10-29 09:04:29 UTC
Description of problem:
Host crashed while multiple guests running SVVP test on it

Version-Release number of selected component (if applicable):

Tree-snapshot4
kernel-2.6.32-425.el6.x86_64
qemu-kvm-qemu-kvm-rhev-0.12.1.2-2.415.el6.x86_64

How reproducible:
6 times

Steps to Reproduce:
1.
a./usr/libexec/qemu-kvm --nodefaults --nodefconfig -m 64G -smp 16 -cpu Nehalem -M rhel6.5.0 -usb -device usb-tablet,id=tablet0 -drive file=win2012-intel-Ml1.raw,if=none,id=drive-virtio0-0-0,format=raw,werror=stop,rerror=stop,cache=none,serial=number -device virtio-blk-pci,drive=drive-virtio0-0-0,id=virti0-0-0,bootindex=1 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:43:24:34:46:61 -uuid 4b610c1e-c267-4eac-929d-4a450e36443c  -monitor stdio -vnc :1 -vga cirrus -name intel-ML1 -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -cdrom 1_en_windows_server_2012_x64_dvd_915478.iso -boot menu=on -device usb-ehci,id=ehci0 -drive file=usb-storage-1.raw,if=none,id=drive-usb-2-0,media=disk,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -device usb-storage,bus=ehci0.0,drive=drive-usb-2-0,id=usb-2-0,removable=on 
done
b./usr/libexec/qemu-kvm --nodefaults --nodefconfig -m 64G -smp 16 -cpu Nehalem -M rhel6.5.0 -usb -device usb-tablet,id=tablet0 -drive file=win2012-intel-Ml2.raw,if=none,id=drive-virtio0-0-0,format=raw,werror=stop,rerror=stop,cache=none,serial=number -device virtio-blk-pci,drive=drive-virtio0-0-0,id=virti0-0-0,bootindex=1 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:51:34:14:3f:61 -uuid 8d270f3e-5e01-42e1-b830-8c51c4fd0348 -monitor stdio -vnc :2 -vga cirrus -name intel-ML2 -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -cdrom 2_en_windows_server_2012_x64_dvd_915478.iso -boot menu=on -device usb-ehci,id=ehci0 -drive file=usb-storage-2.raw,if=none,id=drive-usb-2-0,media=disk,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -device usb-storage,bus=ehci0.0,drive=drive-usb-2-0,id=usb-2-0,removable=on 
c./usr/libexec/qemu-kvm --nodefaults --nodefconfig -m 64G -smp 16 -cpu Nehalem -M rhel6.5.0 -usb -device usb-tablet,id=tablet0 -drive file=win2012-intel-Ml3.raw,if=none,id=drive-virtio0-0-0,format=raw,werror=stop,rerror=stop,cache=none,serial=number -device virtio-blk-pci,drive=drive-virtio0-0-0,id=virti0-0-0,bootindex=1 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:29:14:54:36:41 -uuid 69e495d4-de62-4035-a840-3e25e25562a6 -monitor stdio -vnc :3 -vga cirrus -name Intel-ML3 -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -cdrom 3_en_windows_server_2012_x64_dvd_915478.iso -boot menu=on -device usb-ehci,id=ehci0 -drive file=usb-storage-3.raw,if=none,id=drive-usb-2-0,media=disk,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -device usb-storage,bus=ehci0.0,drive=drive-usb-2-0,id=usb-2-0,removable=on
d./usr/libexec/qemu-kvm --nodefaults --nodefconfig -m 64G -smp 16 -cpu Nehalem -M rhel6.5.0 -usb -device usb-tablet,id=tablet0 -drive file=win2012-intel-Ml4.raw,if=none,id=drive-virtio0-0-0,format=raw,werror=stop,rerror=stop,cache=none,serial=number -device ide-drive,drive=drive-virtio0-0-0,id=virti0-0-0,bootindex=1 -netdev tap,id=hostnet0,vhost=on,script=/etc/qemu-ifup -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:28:32:11:37:41 -uuid 7b5b122b-7010-4ac7-a347-1476f51cea9c -monitor stdio -vnc :4 -vga cirrus -name Intel-ML4 -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -cdrom 4_en_windows_server_2012_x64_dvd_915478.iso -boot menu=on -device usb-ehci,id=ehci0 -drive file=usb-storage-4.raw,if=none,id=drive-usb-2-0,media=disk,format=raw,cache=none,werror=stop,rerror=stop,aio=threads -device usb-storage,bus=ehci0.0,drive=drive-usb-2-0,id=usb-2-0,removable=on  
2.Running SVVP testing on four guests at the same time.
  Such as Disk stress 
          Sleep with io


Actual results:
The host crashed

Expected results:
The test can complete successfully.

Additional info:
Guest Format:RAW
Size:120G
All the guest images store in scsi storage.

Comment 4 Min Deng 2013-10-29 09:32:13 UTC
 Assign it to kernel component firstly,feel free to change it if it is wrong.Thanks

Comment 5 Gleb Natapov 2013-10-29 12:49:59 UTC
Memory is faulty:

<4>{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 229
<4>{1}[Hardware Error]: APEI generic hardware error status
<4>{1}[Hardware Error]: severity: 2, corrected
<4>{1}[Hardware Error]: section: 0, severity: 2, corrected
<4>{1}[Hardware Error]: flags: 0x01
<4>{1}[Hardware Error]: primary
<4>{1}[Hardware Error]: section_type: memory error
<4>{1}[Hardware Error]: error_status: 0x0000000000000004
<4>{1}[Hardware Error]: physical_address: 0x0000000001a84380
<4>{1}[Hardware Error]: node: 1
<4>{1}[Hardware Error]: card: 2
<4>{1}[Hardware Error]: module: 1
<4>{1}[Hardware Error]: bank: 0
<4>{1}[Hardware Error]: row: 384
<4>{1}[Hardware Error]: column: 164
<4>{1}[Hardware Error]: error_type: 2, single-bit ECC

Comment 6 Gerd Hoffmann 2013-10-29 13:07:23 UTC
Looks like the machine has hardware problems.

The log in the vmcore (comment #2 tarball) has plenty of these:

<4>{68}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 229
<4>{68}[Hardware Error]: APEI generic hardware error status
<4>{68}[Hardware Error]: severity: 2, corrected
<4>{68}[Hardware Error]: section: 0, severity: 2, corrected
<4>{68}[Hardware Error]: flags: 0x01
<4>{68}[Hardware Error]: primary
<4>{68}[Hardware Error]: section_type: memory error
<4>{68}[Hardware Error]: error_status: 0x0000000000000004
<4>{68}[Hardware Error]: physical_address: 0x0000004053529d00
<4>{68}[Hardware Error]: node: 2
<4>{68}[Hardware Error]: card: 4
<4>{68}[Hardware Error]: module: 2
<4>{68}[Hardware Error]: bank: 3
<4>{68}[Hardware Error]: device: 4
<4>{68}[Hardware Error]: row: 21375
<4>{68}[Hardware Error]: column: 328
<4>{68}[Hardware Error]: error_type: 2, single-bit ECC

Finally this one (which is where the machine panics):

<0>[Hardware Error]: CPU 34: Machine Check Exception: 5 Bank 5: fa00000000400405
<0>[Hardware Error]: TSC 40f9833b5ea MISC 4200 
<0>[Hardware Error]: PROCESSOR 0:206e6 TIME 1382931845 SOCKET 2 APIC 41
<0>[Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 5: fa00000000400405
<0>[Hardware Error]: RIP !INEXACT! 10:<ffffffff812e0f91> {intel_idle+0xb1/0x170}
<0>[Hardware Error]: TSC 40f9833afd6 MISC 4200 
<0>[Hardware Error]: PROCESSOR 0:206e6 TIME 1382931845 SOCKET 2 APIC 40
<0>[Hardware Error]: Machine check: Processor context corrupt
<0>Kernel panic - not syncing: Fatal Machine check