| Summary: | Guest crash and freezes with hugepages and kmod-kvm-83-262.el5_9.4 | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Roland Friedwagner <roland.friedwagner> |
| Component: | kvm | Assignee: | Andrea Arcangeli <aarcange> |
| Status: | CLOSED WONTFIX | QA Contact: | Virtualization Bugs <virt-bugs> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 5.10 | CC: | chayang, juzhang, lmiksik, michen, mkenneth, qzhang, rhod, roland.friedwagner, virt-maint, ypu |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-12-01 09:22:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 1049888 | ||
Description of problem: Random KVM guest crashes or freezes when doing live migration to/from this node. Guest have hugepages memoryBacking enabled. Sometimes netconsole of the guest show a kernel oops "Bad page state in process" or "list_del corruption. prev->next should be ..." Version-Release number of selected component (if applicable): kmod-kvm >= 83-262.el5_9.3 (RHEL 5.9 after RHSA-2013:0727 - Security Advisory issued 2013-04-09) How reproducible: Setup 6 identical RHEL5 i686 guests with hugepages memorybacking and machine type >= rhel5.5.0. Run them on one kvm host. After <= 50 live migrations of one guest to a second host and back at least one of the (not migrated) guest freezes/crashes. Crash/Freeze mean the guest does not responde to ping pakets and does not responde to input via console (virsh console guest0x) Guest has workload memtester-4.0.8-2.el5 Kernel kernel-PAE-2.6.18-194.el5.i686 (RHEL5.5) and kernel-2.6.18-194.el5.i686 (RHEL5.5) KVM Host is RHEL5.10 x86_64 kernel 2.6.18-371.el5 kvm-83-262.el5_9.4 kmod-kvm-83-262.el5_9.4 Steps to Reproduce: 1. Start 6 rhel5.5 guests with 2.6.18-194 kernel and maschine type rhel5.5.0 on one hypervisor node $ for i in {1..6}; do virsh start guest0$i; sleep .1; done ( $ /usr/libexec/qemu-kvm -S -M rhel5.5.0 -m 512 -mem-prealloc -mem-path /hugepages/libvirt/qemu -smp 2,sockets=2,cores=1,threads=1 -no-kvm-pit-reinjection -boot c -drive file=/dev/mapper/mp_kvm04,if=virtio,boot=on,format=raw,cache=none -drive if=ide,media=cdrom,bus=1,unit=0,readonly=on,format=raw -net nic,macaddr=52:54:00:87:7b:c8,vlan=0,model=virtio -net tap,fd=17,vlan=0 -serial pty -parallel none -usb -vnc 127.0.0.1:3 -k de -vga cirrus -incoming tcp:0.0.0.0:49183 -balloon virtio ) 2. Start memtester workload for each guest CPU $ for i in {1..6}; do ssh root@guest0$i "hostname; memtester 100M >&/dev/null <&- &"; done 3. Do live migrations of one guest until guests crashes (or freezes) #!/bin/bash thishost="hyp01" otherhost="hyp02" VM="guest04" iter=0 while :; do echo echo "$(date --iso-8601=sec): $iter" echo -n " -->" ssh -S none root@$thishost "virsh -c qemu+ssh://$thishost/system migrate --live $VM qemu+ssh://$otherhost/system" sleep 1 echo -n "<--" ssh -S none root@$thishost "virsh -c qemu+ssh://$otherhost/system migrate --live $VM qemu+ssh://$thishost/system" sleep 1 for i in {1..6}; do ping -n -q -c1 -w4 guest0$i >/dev/null || echo "guest0$i crashed" done let iter++ done Actual results: ... 2013-10-10T17:38:53+0200: 19 --> <-- 2013-10-10T17:39:13+0200: 20 --> <-- guest02 crashed 2013-10-10T17:39:36+0200: 21 --> ... Expected results: Doing > 100 migrations without crash (most of the time the first guests fail after 3-7 migrations) Additional info: - Guests are again stable when removing MSR_KVM(_SYSTEM_TIME) patches from kmod-kvm >= 83-262.el5_9.3 as included with RHSA-2013:0727: Patch1656: kvm-kernel-KVM-Fix-for-buffer-overflow-in-handling-of-MSR_KVM_S.patch Patch1657: kvm-kernel-KVM-Convert-MSR_KVM_SYSTEM_TIME-to-use-kvm_write_gue.patch Patch1659: kvm-kernel-do-not-GP-on-unaligned-MSR_KVM_SYSTEM_TIME-write.patch Patch1660: kvm-kernel-kvm-accept-unaligned-MSR_KVM_SYSTEM_TIME-writes.patch - Guests crashes only if they are hugepages mem backed - Guests crashes only if qemu-kvm machine type is >= rhel5.5.0 - The guest migrated never crash, only the other guests on the kvm host do - It does not matter if migrating away or migrating back to the KVM host. - Guest kernel matrix: OS Kernel Status RHEL5.3 2.6.18-128.el5 OK RHEL5.3 2.6.18-128.el5PAE OK RHEL5.4 2.6.18-164.el5 OK RHEL5.4 2.6.18-164.el5PAE OK RHEL5.5 2.6.18-194.el5 CRASH RHEL5.5 2.6.18-194.el5PAE CRASH RHEL5.6 2.6.18-238.el5 OK RHEL5.6 2.6.18-238.el5PAE OK RHEL5.7 2.6.18-274.el5 OK RHEL5.7 2.6.18-274.el5PAE OK RHEL5.8 2.6.18-308.el5 OK RHEL5.8 2.6.18-308.el5PAE OK RHEL5.8 2.6.18-348.el5 OK RHEL5.8 2.6.18-348.el5PAE OK (one guest crashed after 350 migrations) - Hardware Platform ist HP DL380G6 and DL380pGen8 Servers running Xeon X5570/E5-2660 CPU and current BIOS including fix for "Virtual-APIC Page Accesses With 32-Bit PAE Paging may Cause a Sytem Crash"