Description of problem: IT#614033 KVM host with several Linux guests. Some guests use local storage some guests use storage on NFS. NFS filer ran out of space causing all NFS based guests to become unresponsive. Space freed up, but guests are still unresponsive. virsh list, lists all guests in state "running". Can vnc connect to local storage based guests and see display. vnc connects to nfs based guests, but just a black screen. can ping local storage based guests, but not nfs guests. cannot 'virsh shutdown' nfs based guests. Version-Release number of selected component (if applicable): RHEL5.5 x86_64 host KVM guests running RHEL5.4/5 with sparse disks on a Netapp NFS filer How reproducible: always Steps to Reproduce: 1. setup a few VMs with virt-manager, use sparse disks and place them on an NFS export 2.assign more disk space than available on the NFS side 3.try to fill the space, and make it run out Actual results: VMs go to paused state due to qemu io error virsh not aware of that, so the VMs have to be restarted instead of simply unpaused once there is enough space for them to continue Expected results: monitor VMs for disk errors, suspend as required, and allow to unpause once the issue is resolved. Additional info:
libvirt does not set any disk error policy when launching QEMU, so it should be using the default policy, which is to report errors to the guest. The guest should not be pausing at all. Has the KVM default policy been changed somewhere to pause instead ? libvirt in RHEL5 cannot handle a scenario whre the guest pauses, because it does not have any way to receive an event notification of this How did you verify that the guest really is paused, as opposed to the guest OS /appearing/ to be paused by virtue of the kenrel being stuck handling disk I/O errors ?
According to Gleb, the default is to stop on enospc in rhel5.5 and upstream. Assuming we stopped, why does virsh list show the VMs as running?
> According to Gleb, the default is to stop on enospc in rhel5.5 and upstream. Current upstream is not relevant to this discussion. The RHEL5 behaviour is what's important & this is a deviation from upstream behaviour at the time of this version of QEMU. > Assuming we stopped, why does virsh list show the VMs as running? This is because libvirt has no way to knowing that QEMU stopped. The RHEL5 vintage QEMU had no event notification mechanism upstream. The events patches are custom RHEL addition for VDSM, which libvirt does not support.
Fixed in libvirt-0.8.2-1.el5
This one isn't fixed actually, because RHEL5 QEMU doesn't support QMP / events. We would need to wire up the text monitor events to make this work.
Ah I got confused by the "Upstream081" label
Fixed in libvirt-0.8.2-12.el5
Verified with Passed in below environment: RHEL5.6-Server-x86_64-KVM kernel-2.6.18-232.el5 kvm-qemu-img-83-207.el5 libvirt-0.8.2-12.el5 Detailed steps: 1.Create nfs storage and check the size # mount -t nfs 10.66.93.186:/var/lib/libvirt/images/ /var/lib/libvirt/migrate # df -h /var/lib/libvirt/migrate/ Filesystem Size Used Avail Use% Mounted on 10.66.93.186:/var/lib/libvirt/images/ 29G 16G 12G 57% /var/lib/libvirt/migrate 2.Create 2 guests(test1,test2)on nfs storage, 1 guest (rhel55)on local storage using virt-manager.Make sure the total size for test1 and test2 is larger than available space. Like for test1:6G, test2:10G.And not allocate the entire virtual disk for these 2 guests. 3.In host,also create a file in nfs storage to prepare the space release in future # dd if=/dev/zero of=/var/lib/libvirt/migrate/data.img bs=1024 count=1024000 4.After the guests are all finished installation,in host # df -h /var/lib/libvirt/migrate/ Filesystem Size Used Avail Use% Mounted on 10.66.93.186:/var/lib/libvirt/images/ 29G 22G 5.7G 80% /var/lib/libvirt/migrate # virsh list --all Id Name State ---------------------------------- 3 rhel55 running 7 test2 running 8 test1 running 5.In guest test1 and test2,repeat writing files like following until make the nfs storage are full used. # dd if=/dev/zero of=/tmp/write-test1 bs=1024 count=1024000 check nfs storage: # df -h /var/lib/libvirt/migrate/ Filesystem Size Used Avail Use% Mounted on 10.66.93.186:/var/lib/libvirt/images/ 29G 29G 0 100% /var/lib/libvirt/migrate 6.Now check guest status: # virsh list --all Id Name State ---------------------------------- 3 rhel55 running 7 test2 running 8 test1 paused Also Check if host can ping all the guests or not, found that guest test1 can not ping successfully,other guests can ping successfully. # python /usr/share/doc/libvirt-python-0.8.2/events-python/event-test.py qemu:///system ... myDomainEventIOErrorCallback: Domain test1(8) /var/lib/libvirt/migrate/test1.img ide0-hd0 1 myDomainEventCallback1 EVENT: Domain test1(8) Suspended IOError myDomainEventCallback2 EVENT: Domain test1(8) Suspended IOError 7.Release the space for nfs storage in host # rm -rf /var/lib/libvirt/migrate/data.img 8.Check if guest test1 can resume well or not # virsh resume test1 Domain test1 resumed # python /usr/share/doc/libvirt-python-0.8.2/events-python/event-test.py qemu:///system ..... myDomainEventCallback1 EVENT: Domain test1(8) Resumed Unpaused myDomainEventCallback2 EVENT: Domain test1(8) Resumed Unpaused Finally, using ping and other commands in guest test1 to make sure it is resumed successfully. ------------------ This bug can be reproduced with libvirt-0.8.2-10.el5,using # virsh list --all, all the guests are running status all the time, but actually the guests with nfs storage are paused,and can not ping with host successfully. Libvirt event handle also has no output for IOerror.But Guest in local storage works fine all the time.
*** Bug 536946 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2011-0060.html
*** Bug 536947 has been marked as a duplicate of this bug. ***