Created attachment 823400 [details] logs Description of problem: Backup-API: When we have a VM that has an attached disk snapshot from another VM, and the disk in no longer accessible to that VM due to connectivity block to the storage domain, the VM does not enter to 'pause' state, the VM is still reported as 'up'. Version-Release number of selected component (if applicable): 3.3 - is22 vdsm-4.13.0-0.7.beta1.el6ev.x86_64 rhevm-3.3.0-0.32.beta1.el6ev.noarch libvirt-0.10.2-29.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.415.el6.x86_64 How reproducible: 100% Steps to Reproduce: On a block/file pool with 1 host: 1. create a VM with disk and create a snapshot to the VM 2. create another VM and attach the disk snapshot from the first VM to it (using REST) 3. block connectivity from the SPM to the storage server which the first VM disk is located in. Actual results: VM state does not changes to 'pause' even though vdsm reports that it cannot get the VM disk status: GuestMonitor-bs-nfs::DEBUG::2013-11-13 15:08:02,572::vm::643::vm.Vm::(_getDiskStats) vmId=`8c773b09-c28c-414a-b99f-10d0aa956d58`::Disk vda stats not available VM is reported as 'up' by VDSM: [root@nott-vds1 transient]# vdsClient -s 0 list table 8c773b09-c28c-414a-b99f-10d0aa956d58 19335 bs-nfs Up Expected results: After around 10 minutes, the VM state should be moved to 'pause' because it cannot see its disk. Additional info: logs
Created attachment 823468 [details] logs2
Created attachment 823470 [details] logs3 After more than 1 hour, the VM entered to 'pause'. Updated logs and screenshot attached. (logs3)
This is highly dependent on what was the IO workload in the guest. If the guest is not reading any offset that should come from the storage domain (vs. the local transient layer) then the guest will never pause. "Disk vda stats not available" are perfectly normal (in any situation), it just means that we haven't collected enough (or the relevant) data to provide the statistics. I can look more into these messages but they seems unrelated to blocking the storage connectivity (in fact they persist even after you unblock it).
1. vdsm does not pause VMs. qemu does 2. a guest will only pause if the storage layer returned EIO, which would only happen after I/O to the problematic device has been sent and even then it takes a while until the storage layer gives up. So either there is no bug here, or it should be moved to qemu (although you need to show that some I/O has actually dispatched to the problematic device)
There is nothing for vdsm to do here. If this reproduces, feel free to open on qemu to check if there is a problem (although in general it is likely that your vm simply did not try to access the disk and did not hit an EIO).