574017 – RFE: "virsh list" lists all guests in state "running", when the guests are paused on storage error

Bug 574017 - RFE: "virsh list" lists all guests in state "running", when the guests are paused on storage error

Summary: RFE: "virsh list" lists all guests in state "running", when the guests are pa...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Jiri Denemark
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	536946 536947 (view as bug list)
Depends On:
Blocks:	554476
TreeView+	depends on / blocked

Reported:	2010-03-16 12:18 UTC by Dan Yasny
Modified:	2018-11-14 20:14 UTC (History)
CC List:	10 users (show)
Fixed In Version:	libvirt-0.8.2-12.el5
Doc Type:	Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-01-13 22:55:45 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2011:0060	0	normal	SHIPPED_LIVE	libvirt bug fix and enhancement update	2011-01-12 17:22:30 UTC

Description Dan Yasny 2010-03-16 12:18:30 UTC

Description of problem:
IT#614033

KVM host with several Linux guests.  Some guests use local storage some guests use storage on NFS.  NFS filer ran out of space causing all NFS based guests to become unresponsive.  Space freed up, but guests are still unresponsive.

virsh list, lists all guests in state "running".
Can vnc connect to local storage based guests and see display.
vnc connects to nfs based guests, but just a black screen.
can ping local storage based guests, but not nfs guests.
cannot 'virsh shutdown' nfs based guests.

Version-Release number of selected component (if applicable):
RHEL5.5 x86_64 host
KVM guests running RHEL5.4/5 with sparse disks on a Netapp NFS filer

How reproducible:
always

Steps to Reproduce:
1. setup a few VMs with virt-manager, use sparse disks and place them on an NFS export
2.assign more disk space than available on the NFS side
3.try to fill the space, and make it run out
  
Actual results:
VMs go to paused state due to qemu io error
virsh not aware of that, so the VMs have to be restarted instead of simply unpaused once there is enough space for them to continue

Expected results:
monitor VMs for disk errors, suspend as required, and allow to unpause once the issue is resolved.

Additional info:

Comment 1 Daniel Berrangé 2010-03-16 13:27:25 UTC

libvirt does not set any disk error policy  when launching QEMU, so it should be using the default policy, which is to report errors to the guest. The guest should not be pausing at all. Has the KVM default policy been changed somewhere to pause instead ?

libvirt in RHEL5 cannot handle a scenario whre the guest pauses, because it does not have any way to receive an event notification of this 

How did you verify that the guest really is paused, as opposed to the guest OS /appearing/ to be paused by virtue of the kenrel being stuck handling disk I/O errors ?

Comment 2 Dan Yasny 2010-03-16 13:41:17 UTC

According to Gleb, the default is to stop on enospc in rhel5.5 and upstream. Assuming we stopped, why does virsh list show the VMs as running?

Comment 3 Daniel Berrangé 2010-03-16 13:47:36 UTC

> According to Gleb, the default is to stop on enospc in rhel5.5 and upstream.

Current upstream is not relevant to this discussion. The RHEL5 behaviour is what's important & this is a deviation from upstream behaviour at the time of this version of QEMU.

> Assuming we stopped, why does virsh list show the VMs as running?

This is because libvirt has no way to knowing that QEMU stopped. The RHEL5 vintage QEMU had no event  notification mechanism upstream. The events patches are custom RHEL addition for VDSM, which libvirt does not support.

Comment 5 Jiri Denemark 2010-09-02 11:58:21 UTC

Fixed in libvirt-0.8.2-1.el5

Comment 6 Daniel Berrangé 2010-09-02 12:05:46 UTC

This one isn't fixed actually, because RHEL5 QEMU doesn't support QMP / events. We would need to wire up the text monitor events to make this work.

Comment 7 Jiri Denemark 2010-09-02 12:13:55 UTC

Ah I got confused by the "Upstream081" label

Comment 11 Jiri Denemark 2010-11-16 17:38:18 UTC

Fixed in libvirt-0.8.2-12.el5

Comment 13 Min Zhan 2010-11-17 10:29:00 UTC

Verified with Passed in below environment:
RHEL5.6-Server-x86_64-KVM
kernel-2.6.18-232.el5
kvm-qemu-img-83-207.el5
libvirt-0.8.2-12.el5


Detailed steps:

1.Create nfs storage and check the size
# mount -t nfs 10.66.93.186:/var/lib/libvirt/images/ /var/lib/libvirt/migrate

# df -h /var/lib/libvirt/migrate/
Filesystem            Size  Used Avail Use% Mounted on
10.66.93.186:/var/lib/libvirt/images/
                       29G   16G   12G  57% /var/lib/libvirt/migrate

2.Create 2 guests(test1,test2)on nfs storage, 1 guest (rhel55)on local storage using virt-manager.Make sure the total size for test1 and test2 is larger than available space.
Like for test1:6G, test2:10G.And not allocate the entire virtual disk for these 2 guests.

3.In host,also create a file in nfs storage to prepare the space release in future
# dd if=/dev/zero of=/var/lib/libvirt/migrate/data.img bs=1024 count=1024000

4.After the guests are all finished installation,in host
# df -h /var/lib/libvirt/migrate/
Filesystem            Size  Used Avail Use% Mounted on
10.66.93.186:/var/lib/libvirt/images/
                       29G   22G  5.7G  80% /var/lib/libvirt/migrate

# virsh list --all
 Id Name                 State
----------------------------------
  3 rhel55               running
  7 test2                running
  8 test1                running

5.In guest test1 and test2,repeat writing files like following until make the nfs storage are full used.
# dd if=/dev/zero of=/tmp/write-test1 bs=1024 count=1024000

check nfs storage:
# df -h /var/lib/libvirt/migrate/
Filesystem            Size  Used Avail Use% Mounted on
10.66.93.186:/var/lib/libvirt/images/
                       29G   29G     0 100% /var/lib/libvirt/migrate

6.Now check guest status:
# virsh list --all
 Id Name                 State
----------------------------------
  3 rhel55               running
  7 test2                running
  8 test1                paused

Also Check if host can ping all the guests or not, found that guest test1 can not ping successfully,other guests can ping successfully.

#  python /usr/share/doc/libvirt-python-0.8.2/events-python/event-test.py qemu:///system
...
myDomainEventIOErrorCallback: Domain test1(8) /var/lib/libvirt/migrate/test1.img ide0-hd0 1
myDomainEventCallback1 EVENT: Domain test1(8) Suspended IOError
myDomainEventCallback2 EVENT: Domain test1(8) Suspended IOError


7.Release the space for nfs storage in host
# rm -rf /var/lib/libvirt/migrate/data.img

8.Check if guest test1 can resume well or not 
# virsh resume test1
Domain test1 resumed

#  python /usr/share/doc/libvirt-python-0.8.2/events-python/event-test.py qemu:///system
.....
myDomainEventCallback1 EVENT: Domain test1(8) Resumed Unpaused
myDomainEventCallback2 EVENT: Domain test1(8) Resumed Unpaused

Finally, using ping and other commands in guest test1 to make sure it is resumed successfully.

------------------

This bug can be reproduced with libvirt-0.8.2-10.el5,using # virsh list --all, all the guests are running status all the time, but actually the guests with nfs storage are paused,and can not ping with host successfully. Libvirt event handle also has no output for IOerror.But Guest in local storage works fine all the time.

Comment 14 Jiri Denemark 2011-01-05 12:56:32 UTC

*** Bug 536946 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2011-01-13 22:55:45 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0060.html

Comment 17 Jiri Denemark 2011-02-01 10:35:02 UTC

*** Bug 536947 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.