Bug 802118 - KVM guest OS stops responding
Summary: KVM guest OS stops responding
Keywords:
Status: CLOSED DUPLICATE of bug 782631
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm
Version: 5.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: ---
Assignee: Karen Noel
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-03-11 12:51 UTC by Johnny Hughes
Modified: 2013-01-10 00:46 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-03-20 07:46:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Johnny Hughes 2012-03-11 12:51:27 UTC
Description of problem:  

With the latest RHEL 5.8 KVM and kmod-kvm (kvm-83-249.el5), after a period of time the Guest (in this case, also a RHEL-5.8 machine) becomes unresponsive.  No logins via the console or SSH.  When attempting a login there is just a hang.

This issue seems to only happen with machines using the virtio disk driver.

This issue also has info in the RHEL5 mailing list:
https://www.redhat.com/archives/rhelv5-list/2012-February/msg00060.html
https://www.redhat.com/archives/rhelv5-list/2012-March/msg00016.html
https://www.redhat.com/archives/rhelv5-list/2012-March/msg00017.html


This issue is also present in CentOS-5.8 as detailed in this bug:
http://bugs.centos.org/view.php?id=5582

And discussed on the CentOS mailing list in this thead:
http://lists.centos.org/pipermail/centos/2012-March/124043.html


Version-Release number of selected component (if applicable):
kvm-83-249.el5


How reproducible:
Always - after 1-3 days

Steps to Reproduce:
1.  Install the new kvm and kmod-kvm on a Host server with virtio disks, wait a period of time (in my case 1-3 days).

Comment 4 Ronen Hod 2012-03-12 15:25:42 UTC
Dear Johnny Hughes,

Thank you for taking the time to enter a bug report with us. We do appreciate the feedback and look to use reports such as this to guide our efforts at improving our products. That being said, this bug tracking system is not a mechanism for getting support, and as such we are not able to make any guarantees as to the timeliness or suitability of a resolution.
 
If this issue is critical or in any way time sensitive, please raise a ticket through your regular Red Hat support channels to make certain that it gets the proper attention and prioritization to assure a timely resolution. 
 
For information on how to contact the Red Hat production support team, please see:
https://www.redhat.com/support/process/production/#howto

Thanks.

Comment 5 Suqin Huang 2012-03-13 07:03:38 UTC
Can not reproduce this bug with diff host kernel and kvm version

nic: virtio, blk: virtio

details:

I. kvm-83-249.el5

1. host kernel: 2.6.18-308.el5
   1). run 5 guests on host ---> run 17 hours
       guest: 2.6.18-308.el5
       3 guests: keep downloading files
       2 guests: dile
   2). run 1 guest on host ----> run 24 hours
       guest: 2.6.18-308.el5
       dile 19 hours, run dd 5 hours
   3). install guest in lvm --> run 12 hours
       guest: 2.6.18-308.el5
       keep dd in guest
           
2. host kernel: 2.6.18-300.el5
   1). run 6 guests on host ----> 30 hours
       guest: 2.6.18-308.el5
       2 guests: keep downloading files
       1 geusts: running httpd inside and with 100 concurrect connection
       3 guests: dile

3. host kernel: 2.6.18-274.18.1el5
   1). run 1 guest on host ----> run 24 hours
       guest: 2.6.18-308.el5
       dile 19 hours, run dd 5 hours

4. host kernel: 2.6.18-305.el5
   1). run 1 guest on host ----> run 24 hours
       guest: 2.6.18-308.el5
       dile 19 hours, run dd 5 hours

II. kvm-83-246.el5

1. host kernel: 2.6.18-305.el5
   1). run 1 guest on host ----> run 24 hours
       guest: 2.6.18-308.el5
       dile 19 hours, run dd 5 hours

III. kvm-83-239.el5

1. host kernel: 2.6.18-274.17.1.el5
   1). run 1 guest on host ----> run 24 hours
       guest: 2.6.18-308.el5
       dile 19 hours, run dd 5 hours


cmd: /usr/libexec/qemu-kvm -name rhel5.8 -monitor stdio -serial unix:/tmp/serial-20120313-002624-AuCO,server,nowait -drive file=/home/RHEL-Server-5.8-64-virtio.qcow2,index=0,if=virtio,media=disk,cache=none,boot=on,format=qcow2 -net nic,vlan=0,model=virtio,macaddr=a0:01:8a:76:75:00 -net tap,vlan=0,script=/etc/qemu-ifup-switch -m 4096 -smp 2,cores=1,threads=1,sockets=2 -cpu qemu64,+sse2 -soundhw ac97 -vnc :0 -rtc-td-hack -M rhel5.6.0 -boot c -no-kvm-pit-reinjection -usbdevice tablet

Comment 6 Rainer Traut 2012-03-13 08:46:13 UTC
I'm using this simpler cmdline and can reproduce with rhel5.8 release kernel on host and guest:

/usr/libexec/qemu-kvm -m 1024 -boot c -k de \
                -daemonize \
                -drive file=/dev/sdc1,if=virtio,index=0,boot=on \
                -drive file=/dev/mapper/VG00-LVvm2swap,if=virtio,index=1 \
                -net nic,vlan=0,macaddr=xxx,model=virtio \
                -net nic,vlan=1,macaddr=xxx,model=virtio 

Maybe memory related?

Comment 7 Suqin Huang 2012-03-13 08:56:55 UTC
(In reply to comment #6)
> I'm using this simpler cmdline and can reproduce with rhel5.8 release kernel on
> host and guest:
> 
> /usr/libexec/qemu-kvm -m 1024 -boot c -k de \
>                 -daemonize \
>                 -drive file=/dev/sdc1,if=virtio,index=0,boot=on \
>                 -drive file=/dev/mapper/VG00-LVvm2swap,if=virtio,index=1 \
>                 -net nic,vlan=0,macaddr=xxx,model=virtio \
>                 -net nic,vlan=1,macaddr=xxx,model=virtio 
> 
> Maybe memory related?

didn't attache all guests' cmd, the following two scenerios are tested with 1024M mem

seems you test with multi blk and nics, try to test with you scenerio

I. kvm-83-249.el5

1. host kernel: 2.6.18-308.el5
   1). run 5 guests on host ---> run 17 hours
       guest: 2.6.18-308.el5
       3 guests: keep downloading files
       2 guests: dile

/usr/libexec/qemu-kvm -no-hpet -no-kvm-pit-reinjection -usbdevice tablet -rtc-td-hack -startdate now -name test -smp 1,cores=1 -k en-us -m 1024 -boot dcn -net nic,vlan=1,macaddr=89:12:41:43:2c:52,model=virtio -net tap,vlan=1,ifname=virtio_10_1,script=/etc/qemu-ifup -drive file=/home/rhel5.8GA-copy-4.qcow2,media=disk,if=virtio,serial=7b-8a18-438e1f274bd2,boot=on,format=qcow2,werror=stop -soundhw ac97 -vnc :1 -vga cirrus -cpu qemu64,+sse2 -M rhel5.5.0 -notify all -vga cirrus -balloon none -monitor stdio 


2. host kernel: 2.6.18-300.el5
   1). run 6 guests on host ----> 30 hours
       guest: 2.6.18-308.el5
       2 guests: keep downloading files
       1 geusts: running httpd inside and with 100 concurrect connection
       3 guests: dile

/usr/libexec/qemu-kvm -no-hpet -no-kvm-pit-reinjection -usbdevice tablet -rtc-td-hack -startdate now -name test -smp 1,cores=1 -k en-us -m 1024 -boot dcn -net nic,vlan=1,macaddr=83:13:35:43:a3:32,model=virtio -net tap,vlan=1,ifname=virtio_10_6,script=/etc/qemu-ifup -drive file=/home/rhel5.8GA.qcow2,media=disk,if=virtio,serial=73-ad18-438e1f2a4aa2,boot=on,format=qcow2,werror=stop -soundhw ac97 -vnc :6 -cpu qemu64,+sse2 -M rhel5.4.0 -notify all -balloon none

Comment 8 Johnny Hughes 2012-03-13 10:23:30 UTC
Another thing that seems common at least to my scenario and Rainer's is that we both are using LVM on the host for our images where the tests in comment #5 seem to be using qcow2 local images on the host.

Comment 9 Suqin Huang 2012-03-13 10:30:01 UTC
(In reply to comment #8)
> Another thing that seems common at least to my scenario and Rainer's is that we
> both are using LVM on the host for our images where the tests in comment #5
> seem to be using qcow2 local images on the host.

we also test guest installed on lvm

I. kvm-83-249.el5

1. host kernel: 2.6.18-308.el5
   1). run 5 guests on host ---> run 17 hours
       guest: 2.6.18-308.el5
       3 guests: keep downloading files
       2 guests: dile
   2). run 1 guest on host ----> run 24 hours
       guest: 2.6.18-308.el5
       dile 19 hours, run dd 5 hours


--------------------lvm guest ------------------
   3). install guest in lvm --> run 12 hours
       guest: 2.6.18-308.el5
       keep dd in guest

Comment 10 Johnny Hughes 2012-03-13 10:37:53 UTC
I would also point out that I have not had an issue after switching from the
virtio disk driver to the ide driver.

Does Not Work (crashes normally within 24-36 hours):
/usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 1024 -smp
1,sockets=1,cores=1,threads=1 -name testbox -uuid
636e9e51-f14f-b895-0753-2df877cafa8e -monitor
unix:/var/lib/libvirt/qemu/testbox.monitor,server,nowait
-no-kvm-pit-reinjection -boot c -drive
file=/dev/VG_VirtHosts/LV_testbox,if=virtio,boot=on,format=raw,cache=none
-drive if=ide,media=cdrom,bus=1,unit=0,readonly=on,format=raw -net
nic,macaddr=54:52:00:78:68:ee,vlan=0,model=virtio -net tap,fd=19,vlan=0 -serial
pty -parallel none -usb -vnc 127.0.0.1:0 -k en-us -vga cirrus -balloon virtio


Works (up longer than 48 hours, no issues):
/usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 1024 -smp
1,sockets=1,cores=1,threads=1 -name testbox -uuid
636e9e51-f14f-b895-0753-2df877cafa8e -monitor
unix:/var/lib/libvirt/qemu/testbox.monitor,server,nowait
-no-kvm-pit-reinjection -boot c -drive
file=/dev/VG_VirtHosts/LV_testbox,if=ide,bus=0,unit=0,boot=on,format=raw,cache=none
-drive if=ide,media=cdrom,bus=1,unit=0,readonly=on,format=raw -net
nic,macaddr=54:52:00:78:68:ee,vlan=0,model=virtio -net tap,fd=19,vlan=0 -serial
pty -parallel none -usb -vnc 127.0.0.1:0 -k en-us -vga cirrus -balloon virtio

Comment 11 John Nebel 2012-03-13 16:43:26 UTC
Similar problem here with both 5.7 and 5.8, but guest hang could be immediate, or take a few days.  Dropping back to kvm 83-239 fixed it for me.  I have an open support case and there is a fixed test version of 83-249.


John Nebel

Comment 12 Tom Georgoulias 2012-03-13 18:10:38 UTC
Experiencing the same issue (VM is unresponsive after 3-10 hours) and am using
the same workaround (changed disk from vda/virtio to hda/ide).  Both host and
guest are running 5.8 with 2.6.18-308.1.1.el5 kernel.

Command for guest:

/usr/libexec/qemu-kvm -S -M rhel5.4.0 -m 8192 -smp
4,sockets=4,cores=1,threads=1 -name radm002p -uuid
fe57b3f0-74be-ee69-26fa-a6e5c14d8e24 -monitor
unix:/var/lib/libvirt/qemu/radm002p.monitor,server,nowait
-no-kvm-pit-reinjection -boot c -drive
file=/dev/vg0/radm002p,if=ide,bus=0,unit=0,boot=on,format=raw,cache=none -net
nic,macaddr=54:52:00:43:43:06,vlan=0,model=virtio -net tap,fd=19,vlan=0 -net
nic,macaddr=54:52:00:14:b7:6f,vlan=1,model=virtio -net tap,fd=20,vlan=1 -serial
pty -parallel none -usb -vnc 127.0.0.1:0 -k en-us -vga cirrus -balloon virtio

Comment 13 Miya Chen 2012-03-14 04:48:13 UTC
Hi Johnny,
could you please get the core with the following steps? thanks.
1. gdb -p <pid> while bug reproduce --> c
2. kill -ABRT <pid>
3. generate core with
(gdb)generate-core-file
you can get the core here

Comment 14 Dor Laor 2012-03-14 07:41:52 UTC
We believe the current issue is a duplicate of bug #782631.
Please verify the fix that resides on http://people.redhat.com/myamazak/.kvm-83-249.affinity_fix.el5_8/

Comment 15 Tom Georgoulias 2012-03-19 15:47:47 UTC
I installed the kvm-83-249.affinity_fix.el5_8 packages, switched back to the vda/virtio driver, and my VMs have been running for 4 days.

Comment 16 Ivan Pablo Anauati 2012-03-19 19:44:54 UTC
Same here. No hangs reported over the weekend with affinity_fix package. Is this indeed a duplicate of 782631? Can we get an ETA? 

Thank you

Comment 17 Johnny Hughes 2012-03-19 23:42:07 UTC
I have installed the affinity_fix package today and shifted my test VM's drive from IDE to Virtio.  Before this, I had 8 days of no issues having shifted from Virtio to IDE.

I will post if I have the issue (or after 3 days as it never took that long with the non-affinity packages).

Comment 18 Dor Laor 2012-03-20 07:46:21 UTC
(In reply to comment #16)
> Same here. No hangs reported over the weekend with affinity_fix package. Is
> this indeed a duplicate of 782631? Can we get an ETA? 
> 
> Thank you

Thanks for the report and the conformation the patch solves it.
I'll mark this bug as a clone of bug #782631. We already composed a Z stream rpm for it - kvm-83-249.el5_8 and its in an ON_QA state.

*** This bug has been marked as a duplicate of bug 782631 ***

Comment 19 Tom Diehl 2012-03-20 21:13:04 UTC
Any chance you can allow access to 782631 so we can follow it? Referring us to a private bug does not help those without access.

Thanks

Comment 20 Johnny Hughes 2012-03-21 13:31:49 UTC
The SRPM for this is on the public FTP site, but there does not seem to be an announcement on the Errata page for Red Hat Enterprise Linux (v. 5 *) yet (where * is Server, Client, etc.)

Comment 21 Johnny Hughes 2012-03-21 13:54:52 UTC
This is fixed by the RPMS in this announcement:

https://rhn.redhat.com/errata/RHBA-2012-0398.html


Note You need to log in before you can comment on or make changes to this bug.