Bug 1021942 - Guest crash and freezes with hugepages and kmod-kvm-83-262.el5_9.4
Guest crash and freezes with hugepages and kmod-kvm-83-262.el5_9.4
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm (Show other bugs)
x86_64 Linux
unspecified Severity medium
: rc
: ---
Assigned To: Andrea Arcangeli
Virtualization Bugs
Depends On:
Blocks: 1049888
  Show dependency treegraph
Reported: 2013-10-22 07:32 EDT by Roland Friedwagner
Modified: 2014-02-04 10:12 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2013-12-01 04:22:18 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Roland Friedwagner 2013-10-22 07:32:11 EDT
Description of problem:

Random KVM guest crashes or freezes when doing live migration to/from this node.
Guest have hugepages memoryBacking enabled.
Sometimes netconsole of the guest show a kernel oops
"Bad page state in process" or
"list_del corruption. prev->next should be ..."

Version-Release number of selected component (if applicable):

kmod-kvm >= 83-262.el5_9.3 
(RHEL 5.9 after RHSA-2013:0727 - Security Advisory issued 2013-04-09)

How reproducible:

Setup 6 identical RHEL5 i686 guests with hugepages memorybacking and 
machine type >= rhel5.5.0. Run them on one kvm host.
After <= 50 live migrations of one guest to a second host and back
at least one of the (not migrated) guest freezes/crashes.
Crash/Freeze mean the guest does not responde to ping pakets and
does not responde to input via console (virsh console guest0x)

Guest has workload memtester-4.0.8-2.el5
Kernel kernel-PAE-2.6.18-194.el5.i686 (RHEL5.5) and 
kernel-2.6.18-194.el5.i686 (RHEL5.5)

KVM Host is RHEL5.10 x86_64
kernel 2.6.18-371.el5

Steps to Reproduce:

1. Start 6 rhel5.5 guests with 2.6.18-194 kernel and maschine type rhel5.5.0 on one hypervisor node
   $ for i in {1..6}; do virsh start guest0$i; sleep .1; done

   $ /usr/libexec/qemu-kvm -S -M rhel5.5.0 -m 512 -mem-prealloc -mem-path /hugepages/libvirt/qemu 
     -smp 2,sockets=2,cores=1,threads=1 -no-kvm-pit-reinjection -boot c 
     -drive file=/dev/mapper/mp_kvm04,if=virtio,boot=on,format=raw,cache=none 
     -drive if=ide,media=cdrom,bus=1,unit=0,readonly=on,format=raw 
     -net nic,macaddr=52:54:00:87:7b:c8,vlan=0,model=virtio -net tap,fd=17,vlan=0 
     -serial pty -parallel none -usb -vnc -k de -vga cirrus 
     -incoming tcp: -balloon virtio

2. Start memtester workload for each guest CPU

   $ for i in {1..6}; do ssh root@guest0$i "hostname; memtester 100M >&/dev/null <&- &"; done

3. Do live migrations of one guest until guests crashes (or freezes)

   while :; do
       echo "$(date --iso-8601=sec): $iter"
       echo -n " -->"
       ssh -S none root@$thishost "virsh -c qemu+ssh://$thishost/system migrate  --live $VM qemu+ssh://$otherhost/system"
       sleep 1
       echo -n "<--"
       ssh -S none root@$thishost "virsh -c qemu+ssh://$otherhost/system migrate --live $VM qemu+ssh://$thishost/system"
       sleep 1
       for i in {1..6}; do
           ping -n -q -c1 -w4 guest0$i >/dev/null || echo "guest0$i crashed"
       let iter++

Actual results:

2013-10-10T17:38:53+0200: 19

2013-10-10T17:39:13+0200: 20
guest02 crashed

2013-10-10T17:39:36+0200: 21

Expected results:

Doing > 100 migrations without crash
(most of the time the first guests fail after 3-7 migrations)

Additional info:

- Guests are again stable when removing MSR_KVM(_SYSTEM_TIME) patches
  from kmod-kvm >= 83-262.el5_9.3 as included with RHSA-2013:0727:

  Patch1656: kvm-kernel-KVM-Fix-for-buffer-overflow-in-handling-of-MSR_KVM_S.patch
  Patch1657: kvm-kernel-KVM-Convert-MSR_KVM_SYSTEM_TIME-to-use-kvm_write_gue.patch
  Patch1659: kvm-kernel-do-not-GP-on-unaligned-MSR_KVM_SYSTEM_TIME-write.patch
  Patch1660: kvm-kernel-kvm-accept-unaligned-MSR_KVM_SYSTEM_TIME-writes.patch

- Guests crashes only if they are hugepages mem backed

- Guests crashes only if qemu-kvm machine type is >= rhel5.5.0

- The guest migrated never crash, only the other guests on the kvm host do

- It does not matter if migrating away or migrating back to the KVM host.

- Guest kernel matrix:

  OS       Kernel              Status

  RHEL5.3  2.6.18-128.el5      OK
  RHEL5.3  2.6.18-128.el5PAE   OK
  RHEL5.4  2.6.18-164.el5      OK
  RHEL5.4  2.6.18-164.el5PAE   OK 
  RHEL5.5  2.6.18-194.el5      CRASH 
  RHEL5.5  2.6.18-194.el5PAE   CRASH 
  RHEL5.6  2.6.18-238.el5      OK
  RHEL5.6  2.6.18-238.el5PAE   OK
  RHEL5.7  2.6.18-274.el5      OK
  RHEL5.7  2.6.18-274.el5PAE   OK
  RHEL5.8  2.6.18-308.el5      OK
  RHEL5.8  2.6.18-308.el5PAE   OK
  RHEL5.8  2.6.18-348.el5      OK
  RHEL5.8  2.6.18-348.el5PAE   OK (one guest crashed after 350 migrations)

- Hardware Platform ist HP DL380G6 and DL380pGen8 Servers
  running Xeon X5570/E5-2660 CPU and current BIOS including fix for 
  "Virtual-APIC Page Accesses With 32-Bit PAE Paging may Cause a Sytem Crash"

Note You need to log in before you can comment on or make changes to this bug.