1029886 – [vdsm] [backup-api] It takes very long time (more than 1 hour) to VM with transient volume to get into 'pause' state when it cannot see its attached disk

Bug 1029886 - [vdsm] [backup-api] It takes very long time (more than 1 hour) to VM with transient volume to get into 'pause' state when it cannot see its attached disk

Summary: [vdsm] [backup-api] It takes very long time (more than 1 hour) to VM with tra...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.3.0
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.4.0
Assignee:	Federico Simoncelli
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-11-13 13:16 UTC by Elad
Modified:	2016-02-10 18:47 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-02-16 09:26:35 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (6.65 MB, application/x-gzip) 2013-11-13 13:16 UTC, Elad	no flags	Details
logs2 (5.89 MB, application/x-gzip) 2013-11-13 14:24 UTC, Elad	no flags	Details
logs3 (6.67 MB, application/x-gzip) 2013-11-13 14:31 UTC, Elad	no flags	Details
View All

Description Elad 2013-11-13 13:16:02 UTC

Created attachment 823400 [details]
logs

Description of problem:
Backup-API:
When we have a VM that has an attached disk snapshot from another VM, and the disk in no longer accessible to that VM due to connectivity block to the storage domain, the VM does not enter to 'pause' state, the VM is still reported as 'up'.

Version-Release number of selected component (if applicable):
3.3 - is22
vdsm-4.13.0-0.7.beta1.el6ev.x86_64
rhevm-3.3.0-0.32.beta1.el6ev.noarch
libvirt-0.10.2-29.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.415.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
On a block/file pool with 1 host:                                                                
1. create a VM with disk and create a snapshot to the VM
2. create another VM and attach the disk snapshot from the first VM to it (using REST)
3. block connectivity from the SPM to the storage server which the first VM disk is located in. 

Actual results:
VM state does not changes to 'pause' even though vdsm reports that it cannot get the VM disk status:

GuestMonitor-bs-nfs::DEBUG::2013-11-13 15:08:02,572::vm::643::vm.Vm::(_getDiskStats) vmId=`8c773b09-c28c-414a-b99f-10d0aa956d58`::Disk vda stats not available

VM is reported as 'up' by VDSM:

[root@nott-vds1 transient]# vdsClient -s 0 list table
8c773b09-c28c-414a-b99f-10d0aa956d58  19335  bs-nfs               Up

Expected results:
After around 10 minutes, the VM state should be moved to 'pause' because it cannot see its disk.

Additional info: logs

Comment 1 Elad 2013-11-13 14:24:29 UTC

Created attachment 823468 [details]
logs2

Comment 2 Elad 2013-11-13 14:31:03 UTC

Created attachment 823470 [details]
logs3

After more than 1 hour, the VM entered to 'pause'.


Updated logs and screenshot attached. (logs3)

Comment 3 Federico Simoncelli 2013-11-15 17:55:06 UTC

This is highly dependent on what was the IO workload in the guest. If the guest is not reading any offset that should come from the storage domain (vs. the local transient layer) then the guest will never pause.

"Disk vda stats not available" are perfectly normal (in any situation), it just means that we haven't collected enough (or the relevant) data to provide the statistics.

I can look more into these messages but they seems unrelated to blocking the storage connectivity (in fact they persist even after you unblock it).

Comment 4 Ayal Baron 2013-11-17 10:31:45 UTC

1. vdsm does not pause VMs.  qemu does
2. a guest will only pause if the storage layer returned EIO, which would only happen after I/O to the problematic device has been sent and even then it takes a while until the storage layer gives up.

So either there is no bug here, or it should be moved to qemu (although you need to show that some I/O has actually dispatched to the problematic device)

Comment 5 Ayal Baron 2014-02-16 09:26:35 UTC

There is nothing for vdsm to do here.  If this reproduces, feel free to open on qemu to check if there is a problem (although in general it is likely that your vm simply did not try to access the disk and did not hit an EIO).

Note You need to log in before you can comment on or make changes to this bug.