Bug 1029886 - [vdsm] [backup-api] It takes very long time (more than 1 hour) to VM with transient volume to get into 'pause' state when it cannot see its attached disk
[vdsm] [backup-api] It takes very long time (more than 1 hour) to VM with tra...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.3.0
x86_64 Unspecified
unspecified Severity high
: ---
: 3.4.0
Assigned To: Federico Simoncelli
Aharon Canan
storage
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-13 08:16 EST by Elad
Modified: 2016-02-10 13:47 EST (History)
7 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-02-16 04:26:35 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (6.65 MB, application/x-gzip)
2013-11-13 08:16 EST, Elad
no flags Details
logs2 (5.89 MB, application/x-gzip)
2013-11-13 09:24 EST, Elad
no flags Details
logs3 (6.67 MB, application/x-gzip)
2013-11-13 09:31 EST, Elad
no flags Details

  None (edit)
Description Elad 2013-11-13 08:16:02 EST
Created attachment 823400 [details]
logs

Description of problem:
Backup-API:
When we have a VM that has an attached disk snapshot from another VM, and the disk in no longer accessible to that VM due to connectivity block to the storage domain, the VM does not enter to 'pause' state, the VM is still reported as 'up'.

Version-Release number of selected component (if applicable):
3.3 - is22
vdsm-4.13.0-0.7.beta1.el6ev.x86_64
rhevm-3.3.0-0.32.beta1.el6ev.noarch
libvirt-0.10.2-29.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.415.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
On a block/file pool with 1 host:                                                                
1. create a VM with disk and create a snapshot to the VM
2. create another VM and attach the disk snapshot from the first VM to it (using REST)
3. block connectivity from the SPM to the storage server which the first VM disk is located in. 

Actual results:
VM state does not changes to 'pause' even though vdsm reports that it cannot get the VM disk status:

GuestMonitor-bs-nfs::DEBUG::2013-11-13 15:08:02,572::vm::643::vm.Vm::(_getDiskStats) vmId=`8c773b09-c28c-414a-b99f-10d0aa956d58`::Disk vda stats not available

VM is reported as 'up' by VDSM:

[root@nott-vds1 transient]# vdsClient -s 0 list table
8c773b09-c28c-414a-b99f-10d0aa956d58  19335  bs-nfs               Up

Expected results:
After around 10 minutes, the VM state should be moved to 'pause' because it cannot see its disk.

Additional info: logs
Comment 1 Elad 2013-11-13 09:24:29 EST
Created attachment 823468 [details]
logs2
Comment 2 Elad 2013-11-13 09:31:03 EST
Created attachment 823470 [details]
logs3

After more than 1 hour, the VM entered to 'pause'.


Updated logs and screenshot attached. (logs3)
Comment 3 Federico Simoncelli 2013-11-15 12:55:06 EST
This is highly dependent on what was the IO workload in the guest. If the guest is not reading any offset that should come from the storage domain (vs. the local transient layer) then the guest will never pause.

"Disk vda stats not available" are perfectly normal (in any situation), it just means that we haven't collected enough (or the relevant) data to provide the statistics.

I can look more into these messages but they seems unrelated to blocking the storage connectivity (in fact they persist even after you unblock it).
Comment 4 Ayal Baron 2013-11-17 05:31:45 EST
1. vdsm does not pause VMs.  qemu does
2. a guest will only pause if the storage layer returned EIO, which would only happen after I/O to the problematic device has been sent and even then it takes a while until the storage layer gives up.

So either there is no bug here, or it should be moved to qemu (although you need to show that some I/O has actually dispatched to the problematic device)
Comment 5 Ayal Baron 2014-02-16 04:26:35 EST
There is nothing for vdsm to do here.  If this reproduces, feel free to open on qemu to check if there is a problem (although in general it is likely that your vm simply did not try to access the disk and did not hit an EIO).

Note You need to log in before you can comment on or make changes to this bug.