Description of problem: Random KVM guest crashes or freezes when doing live migration to/from this node. Guest have hugepages memoryBacking enabled. Sometimes netconsole of the guest show a kernel oops "Bad page state in process" or "list_del corruption. prev->next should be ..." Version-Release number of selected component (if applicable): kmod-kvm >= 83-262.el5_9.3 (RHEL 5.9 after RHSA-2013:0727 - Security Advisory issued 2013-04-09) How reproducible: Setup 6 identical RHEL5 i686 guests with hugepages memorybacking and machine type >= rhel5.5.0. Run them on one kvm host. After <= 50 live migrations of one guest to a second host and back at least one of the (not migrated) guest freezes/crashes. Crash/Freeze mean the guest does not responde to ping pakets and does not responde to input via console (virsh console guest0x) Guest has workload memtester-4.0.8-2.el5 Kernel kernel-PAE-2.6.18-194.el5.i686 (RHEL5.5) and kernel-2.6.18-194.el5.i686 (RHEL5.5) KVM Host is RHEL5.10 x86_64 kernel 2.6.18-371.el5 kvm-83-262.el5_9.4 kmod-kvm-83-262.el5_9.4 Steps to Reproduce: 1. Start 6 rhel5.5 guests with 2.6.18-194 kernel and maschine type rhel5.5.0 on one hypervisor node $ for i in {1..6}; do virsh start guest0$i; sleep .1; done ( $ /usr/libexec/qemu-kvm -S -M rhel5.5.0 -m 512 -mem-prealloc -mem-path /hugepages/libvirt/qemu -smp 2,sockets=2,cores=1,threads=1 -no-kvm-pit-reinjection -boot c -drive file=/dev/mapper/mp_kvm04,if=virtio,boot=on,format=raw,cache=none -drive if=ide,media=cdrom,bus=1,unit=0,readonly=on,format=raw -net nic,macaddr=52:54:00:87:7b:c8,vlan=0,model=virtio -net tap,fd=17,vlan=0 -serial pty -parallel none -usb -vnc 127.0.0.1:3 -k de -vga cirrus -incoming tcp:0.0.0.0:49183 -balloon virtio ) 2. Start memtester workload for each guest CPU $ for i in {1..6}; do ssh root@guest0$i "hostname; memtester 100M >&/dev/null <&- &"; done 3. Do live migrations of one guest until guests crashes (or freezes) #!/bin/bash thishost="hyp01" otherhost="hyp02" VM="guest04" iter=0 while :; do echo echo "$(date --iso-8601=sec): $iter" echo -n " -->" ssh -S none root@$thishost "virsh -c qemu+ssh://$thishost/system migrate --live $VM qemu+ssh://$otherhost/system" sleep 1 echo -n "<--" ssh -S none root@$thishost "virsh -c qemu+ssh://$otherhost/system migrate --live $VM qemu+ssh://$thishost/system" sleep 1 for i in {1..6}; do ping -n -q -c1 -w4 guest0$i >/dev/null || echo "guest0$i crashed" done let iter++ done Actual results: ... 2013-10-10T17:38:53+0200: 19 --> <-- 2013-10-10T17:39:13+0200: 20 --> <-- guest02 crashed 2013-10-10T17:39:36+0200: 21 --> ... Expected results: Doing > 100 migrations without crash (most of the time the first guests fail after 3-7 migrations) Additional info: - Guests are again stable when removing MSR_KVM(_SYSTEM_TIME) patches from kmod-kvm >= 83-262.el5_9.3 as included with RHSA-2013:0727: Patch1656: kvm-kernel-KVM-Fix-for-buffer-overflow-in-handling-of-MSR_KVM_S.patch Patch1657: kvm-kernel-KVM-Convert-MSR_KVM_SYSTEM_TIME-to-use-kvm_write_gue.patch Patch1659: kvm-kernel-do-not-GP-on-unaligned-MSR_KVM_SYSTEM_TIME-write.patch Patch1660: kvm-kernel-kvm-accept-unaligned-MSR_KVM_SYSTEM_TIME-writes.patch - Guests crashes only if they are hugepages mem backed - Guests crashes only if qemu-kvm machine type is >= rhel5.5.0 - The guest migrated never crash, only the other guests on the kvm host do - It does not matter if migrating away or migrating back to the KVM host. - Guest kernel matrix: OS Kernel Status RHEL5.3 2.6.18-128.el5 OK RHEL5.3 2.6.18-128.el5PAE OK RHEL5.4 2.6.18-164.el5 OK RHEL5.4 2.6.18-164.el5PAE OK RHEL5.5 2.6.18-194.el5 CRASH RHEL5.5 2.6.18-194.el5PAE CRASH RHEL5.6 2.6.18-238.el5 OK RHEL5.6 2.6.18-238.el5PAE OK RHEL5.7 2.6.18-274.el5 OK RHEL5.7 2.6.18-274.el5PAE OK RHEL5.8 2.6.18-308.el5 OK RHEL5.8 2.6.18-308.el5PAE OK RHEL5.8 2.6.18-348.el5 OK RHEL5.8 2.6.18-348.el5PAE OK (one guest crashed after 350 migrations) - Hardware Platform ist HP DL380G6 and DL380pGen8 Servers running Xeon X5570/E5-2660 CPU and current BIOS including fix for "Virtual-APIC Page Accesses With 32-Bit PAE Paging may Cause a Sytem Crash"