Bug 1021942 - Guest crash and freezes with hugepages and kmod-kvm-83-262.el5_9.4
Summary: Guest crash and freezes with hugepages and kmod-kvm-83-262.el5_9.4
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm
Version: 5.10
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Andrea Arcangeli
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: 1049888
TreeView+ depends on / blocked
 
Reported: 2013-10-22 11:32 UTC by Roland Friedwagner
Modified: 2018-12-03 20:24 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-12-01 09:22:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Roland Friedwagner 2013-10-22 11:32:11 UTC
Description of problem:

Random KVM guest crashes or freezes when doing live migration to/from this node.
Guest have hugepages memoryBacking enabled.
Sometimes netconsole of the guest show a kernel oops
"Bad page state in process" or
"list_del corruption. prev->next should be ..."


Version-Release number of selected component (if applicable):

kmod-kvm >= 83-262.el5_9.3 
(RHEL 5.9 after RHSA-2013:0727 - Security Advisory issued 2013-04-09)


How reproducible:

Setup 6 identical RHEL5 i686 guests with hugepages memorybacking and 
machine type >= rhel5.5.0. Run them on one kvm host.
After <= 50 live migrations of one guest to a second host and back
at least one of the (not migrated) guest freezes/crashes.
Crash/Freeze mean the guest does not responde to ping pakets and
does not responde to input via console (virsh console guest0x)

Guest has workload memtester-4.0.8-2.el5
Kernel kernel-PAE-2.6.18-194.el5.i686 (RHEL5.5) and 
kernel-2.6.18-194.el5.i686 (RHEL5.5)

KVM Host is RHEL5.10 x86_64
kernel 2.6.18-371.el5
kvm-83-262.el5_9.4
kmod-kvm-83-262.el5_9.4

Steps to Reproduce:

1. Start 6 rhel5.5 guests with 2.6.18-194 kernel and maschine type rhel5.5.0 on one hypervisor node
   
   $ for i in {1..6}; do virsh start guest0$i; sleep .1; done

   (
   $ /usr/libexec/qemu-kvm -S -M rhel5.5.0 -m 512 -mem-prealloc -mem-path /hugepages/libvirt/qemu 
     -smp 2,sockets=2,cores=1,threads=1 -no-kvm-pit-reinjection -boot c 
     -drive file=/dev/mapper/mp_kvm04,if=virtio,boot=on,format=raw,cache=none 
     -drive if=ide,media=cdrom,bus=1,unit=0,readonly=on,format=raw 
     -net nic,macaddr=52:54:00:87:7b:c8,vlan=0,model=virtio -net tap,fd=17,vlan=0 
     -serial pty -parallel none -usb -vnc 127.0.0.1:3 -k de -vga cirrus 
     -incoming tcp:0.0.0.0:49183 -balloon virtio
   )

2. Start memtester workload for each guest CPU

   $ for i in {1..6}; do ssh root@guest0$i "hostname; memtester 100M >&/dev/null <&- &"; done

3. Do live migrations of one guest until guests crashes (or freezes)

   #!/bin/bash
   thishost="hyp01"
   otherhost="hyp02"
   VM="guest04"
   iter=0
   while :; do
       echo
       echo "$(date --iso-8601=sec): $iter"
       echo -n " -->"
       ssh -S none root@$thishost "virsh -c qemu+ssh://$thishost/system migrate  --live $VM qemu+ssh://$otherhost/system"
       sleep 1
       echo -n "<--"
       ssh -S none root@$thishost "virsh -c qemu+ssh://$otherhost/system migrate --live $VM qemu+ssh://$thishost/system"
       sleep 1
       for i in {1..6}; do
           ping -n -q -c1 -w4 guest0$i >/dev/null || echo "guest0$i crashed"
       done
       let iter++
   done


Actual results:

...
2013-10-10T17:38:53+0200: 19
 -->
<--

2013-10-10T17:39:13+0200: 20
 -->
<--
guest02 crashed

2013-10-10T17:39:36+0200: 21
 -->
...


Expected results:

Doing > 100 migrations without crash
(most of the time the first guests fail after 3-7 migrations)


Additional info:

- Guests are again stable when removing MSR_KVM(_SYSTEM_TIME) patches
  from kmod-kvm >= 83-262.el5_9.3 as included with RHSA-2013:0727:

  Patch1656: kvm-kernel-KVM-Fix-for-buffer-overflow-in-handling-of-MSR_KVM_S.patch
  Patch1657: kvm-kernel-KVM-Convert-MSR_KVM_SYSTEM_TIME-to-use-kvm_write_gue.patch
  Patch1659: kvm-kernel-do-not-GP-on-unaligned-MSR_KVM_SYSTEM_TIME-write.patch
  Patch1660: kvm-kernel-kvm-accept-unaligned-MSR_KVM_SYSTEM_TIME-writes.patch

- Guests crashes only if they are hugepages mem backed

- Guests crashes only if qemu-kvm machine type is >= rhel5.5.0

- The guest migrated never crash, only the other guests on the kvm host do

- It does not matter if migrating away or migrating back to the KVM host.

- Guest kernel matrix:

  OS       Kernel              Status

  RHEL5.3  2.6.18-128.el5      OK
  RHEL5.3  2.6.18-128.el5PAE   OK
  RHEL5.4  2.6.18-164.el5      OK
  RHEL5.4  2.6.18-164.el5PAE   OK 
  RHEL5.5  2.6.18-194.el5      CRASH 
  RHEL5.5  2.6.18-194.el5PAE   CRASH 
  RHEL5.6  2.6.18-238.el5      OK
  RHEL5.6  2.6.18-238.el5PAE   OK
  RHEL5.7  2.6.18-274.el5      OK
  RHEL5.7  2.6.18-274.el5PAE   OK
  RHEL5.8  2.6.18-308.el5      OK
  RHEL5.8  2.6.18-308.el5PAE   OK
  RHEL5.8  2.6.18-348.el5      OK
  RHEL5.8  2.6.18-348.el5PAE   OK (one guest crashed after 350 migrations)

- Hardware Platform ist HP DL380G6 and DL380pGen8 Servers
  running Xeon X5570/E5-2660 CPU and current BIOS including fix for 
  "Virtual-APIC Page Accesses With 32-Bit PAE Paging may Cause a Sytem Crash"


Note You need to log in before you can comment on or make changes to this bug.