Bug 1029886

Summary: [vdsm] [backup-api] It takes very long time (more than 1 hour) to VM with transient volume to get into 'pause' state when it cannot see its attached disk
Product: Red Hat Enterprise Virtualization Manager Reporter: Elad <ebenahar>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED NOTABUG QA Contact: Aharon Canan <acanan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.3.0CC: abaron, amureini, bazulay, iheim, lpeer, scohen, yeylon
Target Milestone: ---Keywords: Triaged
Target Release: 3.4.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-02-16 09:26:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
logs
none
logs2
none
logs3 none

Description Elad 2013-11-13 13:16:02 UTC
Created attachment 823400 [details]
logs

Description of problem:
Backup-API:
When we have a VM that has an attached disk snapshot from another VM, and the disk in no longer accessible to that VM due to connectivity block to the storage domain, the VM does not enter to 'pause' state, the VM is still reported as 'up'.

Version-Release number of selected component (if applicable):
3.3 - is22
vdsm-4.13.0-0.7.beta1.el6ev.x86_64
rhevm-3.3.0-0.32.beta1.el6ev.noarch
libvirt-0.10.2-29.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.415.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
On a block/file pool with 1 host:                                                                
1. create a VM with disk and create a snapshot to the VM
2. create another VM and attach the disk snapshot from the first VM to it (using REST)
3. block connectivity from the SPM to the storage server which the first VM disk is located in. 

Actual results:
VM state does not changes to 'pause' even though vdsm reports that it cannot get the VM disk status:

GuestMonitor-bs-nfs::DEBUG::2013-11-13 15:08:02,572::vm::643::vm.Vm::(_getDiskStats) vmId=`8c773b09-c28c-414a-b99f-10d0aa956d58`::Disk vda stats not available

VM is reported as 'up' by VDSM:

[root@nott-vds1 transient]# vdsClient -s 0 list table
8c773b09-c28c-414a-b99f-10d0aa956d58  19335  bs-nfs               Up

Expected results:
After around 10 minutes, the VM state should be moved to 'pause' because it cannot see its disk.

Additional info: logs

Comment 1 Elad 2013-11-13 14:24:29 UTC
Created attachment 823468 [details]
logs2

Comment 2 Elad 2013-11-13 14:31:03 UTC
Created attachment 823470 [details]
logs3

After more than 1 hour, the VM entered to 'pause'.


Updated logs and screenshot attached. (logs3)

Comment 3 Federico Simoncelli 2013-11-15 17:55:06 UTC
This is highly dependent on what was the IO workload in the guest. If the guest is not reading any offset that should come from the storage domain (vs. the local transient layer) then the guest will never pause.

"Disk vda stats not available" are perfectly normal (in any situation), it just means that we haven't collected enough (or the relevant) data to provide the statistics.

I can look more into these messages but they seems unrelated to blocking the storage connectivity (in fact they persist even after you unblock it).

Comment 4 Ayal Baron 2013-11-17 10:31:45 UTC
1. vdsm does not pause VMs.  qemu does
2. a guest will only pause if the storage layer returned EIO, which would only happen after I/O to the problematic device has been sent and even then it takes a while until the storage layer gives up.

So either there is no bug here, or it should be moved to qemu (although you need to show that some I/O has actually dispatched to the problematic device)

Comment 5 Ayal Baron 2014-02-16 09:26:35 UTC
There is nothing for vdsm to do here.  If this reproduces, feel free to open on qemu to check if there is a problem (although in general it is likely that your vm simply did not try to access the disk and did not hit an EIO).