1069095 – vdsm not reporting a VM that is reported running by libvirt

Bug 1069095 - vdsm not reporting a VM that is reported running by libvirt

Summary: vdsm not reporting a VM that is reported running by libvirt

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	oVirt
Classification:	Retired
Component:	vdsm
Sub Component:
Version:	3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	3.3.4
Assignee:	Francesco Romani
QA Contact:	Gil Klein
Docs Contact:
URL:
Whiteboard:	virt
Depends On:	1063336
Blocks:
TreeView+	depends on / blocked

Reported:	2014-02-24 07:36 UTC by Francesco Romani
Modified:	2014-03-31 12:28 UTC (History)
CC List:	8 users (show)
Fixed In Version:	ovirt-3.3.4-ga
Clone Of:	1063336
Environment:
Last Closed:	2014-03-31 12:28:48 UTC
oVirt Team:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	24822	0	None	MERGED	vm: recover must finish with libvirt connection	Never

Description Francesco Romani 2014-02-24 07:36:00 UTC

+++ This bug was initially created as a clone of Bug #1063336 +++

Description of problem:
Environment:
- Setup with 2 vms (1 disk each), 2 storage domains.
- Each vm has a disk on a different SD. 
- First vm with snapshot.
- second vm has attached snapshot disk of first vm (through backup api)

when blocking connection from host that vms are running on to the storage domain that contains the first vm's disk (i.e. the vm with the snapshot), the second vm should be paused due to EIO.
the problem is, when unblocking the connection and trying to run the paused vm, it fails with error.
* In vdsm logs, when trying to resume the paused vm also fails with error:
-->

* Powering off the paused vm works, but then trying to start it fails with the following error:
'Changed state to Down: Requested operation is not valid: domain 'v
m_0' is already active'.

output **before** shutting the paused vm:
# vdsClient -s 0 list
ba8a2610-aa49-4505-ab12-ca6b9ac724a3  19207  vm_0                 Paused

* After powering off the paused vm, libvirt still shows *both* vms as running:
# virsh -r list
 Id    Name                           State
----------------------------------------------------
 27    vm_0                           running
 29    vm_1                           running

However, vdsm reports that there are no running vms, and so does the engine/UI

# vdsClient -s 0 list
None

Restarting vdsm on the host solves the problem (vdsm kills both qemu process after recovering from the restart)



Version-Release number of selected component (if applicable):
vdsm-4.14.1-2.el6.x86_64
libvirt-0.10.2-29.el6.1.x86_64
ovirt-engine-3.4.0-0.5.beta1.el6.noarch


How reproducible:
100%

Steps to Reproduce:
Environment:
- Setup with 2 vms (1 disk each), 2 storage domains (on same storage server).
- Each vm has a disk on a different SD. 
- First vm with snapshot.
- second vm has snapshot of first vm's disk attached (through backup api)
1. Run both vms on same host
2. Block connection from host to storage domain containing first vm's disk
3. both vms should enter Paused state
4. Unblock connection
5. Try to resume vms or power off and rerun the vms

Actual results:
Step 5 fails as described above

Expected results:
vms should recover from paused state when connection was unblocked

Additional info:

--- Additional comment from Francesco Romani on 2014-02-10 12:33:07 EST ---

Hi, are the attached one the complete logs? Could you please provide some timing information of the steps you executed:
- at which time the connection was unblocked?
- at which time you try to resume the VMs?

--- Additional comment from  on 2014-02-12 04:57:22 EST ---



--- Additional comment from  on 2014-02-12 04:57:44 EST ---

Hi, 
I attached new clean logs:
- connection was blocked at: 11:29:39
- VMs were paused at: 11:35:45
- connection was unblocked at: 11:36
- resume the VMs at: 11:37:56
- power off VMs: 11:38:28
- rerun VMs: 11:39:41

--- Additional comment from Francesco Romani on 2014-02-12 05:23:42 EST ---

Thanks, the information you provided are very helpful and helped me to quickly find the problem.

The root cause of the issue is the following:
* if, in the recovery path, VDSM is unable to access any of the volumes, it raises an exception (see [1] below).
* The exception is handled, but the lookup of the domain in libvirt is skipped
(it should happen just after the buildConfDevices step).
* as result, a Vm object is wrongly registered inside VDSM but without the established connection to libvirt, which is represented by its _dom attribute.
* this in turn causes the failure of the vmCont operation (see [2] below) and the disalignment of the state between VDSM and libvirt.

Log exceperts demonstrating the above:

[1]
Thread-13::ERROR::2014-02-12 11:32:34,718::vm::1991::vm.Vm::(_normalizeVdsmImg) vmId=`d200dff9-5656-402f-8cc4-9e1a5c68e938`::Unable to get volume size for ede4f955-549c-4f97-b103-127ee2b3672d
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 1987, in _normalizeVdsmImg
    drv['truesize'] = res['truesize']
KeyError: 'truesize'
Thread-13::INFO::2014-02-12 11:32:34,719::vm::2244::vm.Vm::(_startUnderlyingVm) vmId=`d200dff9-5656-402f-8cc4-9e1a5c68e938`::Skipping errors on recovery
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 2233, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/vm.py", line 3093, in _run
    devices = self.buildConfDevices()
  File "/usr/share/vdsm/vm.py", line 2061, in buildConfDevices
    self._normalizeVdsmImg(drv)
  File "/usr/share/vdsm/vm.py", line 1993, in _normalizeVdsmImg
    drv['volumeID'])
RuntimeError: Volume ede4f955-549c-4f97-b103-127ee2b3672d is corrupted or missing
Thread-13::DEBUG::2014-02-12 11:32:34,720::vm::2247::vm.Vm::(_startUnderlyingVm) vmId=`d200dff9-5656-402f-8cc4-9e1a5c68e938`::_ongoingCreations released

[2]
Thread-166::DEBUG::2014-02-12 11:37:56,327::BindingXMLRPC::970::vds::(wrapper) client [10.35.161.54]::call vmCont with ('d200dff9-5656-402f-8cc4-9e1a5c68e938',) {} flowID [28e177e]
Thread-166::ERROR::2014-02-12 11:37:56,327::BindingXMLRPC::989::vds::(wrapper) unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 973, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 209, in vmCont
    return vm.cont()
  File "/usr/share/vdsm/API.py", line 154, in cont
    return v.cont()
  File "/usr/share/vdsm/vm.py", line 2587, in cont
    self._underlyingCont()
  File "/usr/share/vdsm/vm.py", line 3726, in _underlyingCont
    hooks.before_vm_cont(self._dom.XMLDesc(0), self.conf)
AttributeError: 'NoneType' object has no attribute 'XMLDesc'

--- Additional comment from Francesco Romani on 2014-02-12 06:13:17 EST ---

Worth to remark that VDSM was restarted while the storage was blocked/unavailble (following the times you provided) and then triggered the recovery and all the following:

MainThread::DEBUG::2014-02-12 11:30:45,946::vdsm::55::vds::(sigtermHandler) Received signal 15

MainThread::DEBUG::2014-02-12 11:30:45,946::vdsm::55::vds::(sigtermHandler) Received signal 15
MainThread::INFO::2014-02-12 11:30:46,898::vmChannels::183::vds::(stop) VM channels listener was stopped.
MainThread::INFO::2014-02-12 11:30:46,898::momIF::84::MOM::(stop) Shutting down MOM
MainThread::INFO::2014-02-12 11:30:46,899::logUtils::44::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None)
MainThread::DEBUG::2014-02-12 11:30:46,901::sp::380::Storage.StoragePool::(cleanupMasterMount) master `/rhev/data-center/mnt/blockSD/01a745ee-26ee-4e73-a99a-05e791e9a5eb/master` is not mounted, skipping
MainThread::DEBUG::2014-02-12 11:30:46,902::sp::380::Storage.StoragePool::(cleanupMasterMount) master `/rhev/data-center/mnt/blockSD/74ee9bdc-d6cd-49af-bbd4-4e8e2a009a25/master` is not mounted, skipping
MainThread::DEBUG::2014-02-12 11:30:46,947::vdsm::55::vds::(sigtermHandler) Received signal 15
MainThread::DEBUG::2014-02-12 11:30:46,947::clientIF::207::vds::(prepareForShutdown) cannot run prepareForShutdown concurrently
MainThread::INFO::2014-02-12 11:30:47,294::domainMonitor::125::Storage.DomainMonitor::(close) Stopping domain monitors
MainThread::INFO::2014-02-12 11:30:47,295::domainMonitor::113::Storage.DomainMonitor::(stopMonitoring) Stop monitoring 13f48873-6f42-48fe-a4f5-1fa3c19a6f39
Thread-36::DEBUG::2014-02-12 11:30:47,295::domainMonitor::175::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for 13f48873-6f42-48fe-a4f5-1fa3c19a6f39
VM Channels Listener::INFO::2014-02-12 11:30:47,595::vmChannels::178::vds::(run) VM channels listener thread has ended.
Thread-37::DEBUG::2014-02-12 11:30:53,137::domainMonitor::192::Storage.DomainMonitorThread::(_monitorDomain) Refreshing domain 0639c74b-0a19-47c4-919a-105902edefbc
MainThread::INFO::2014-02-12 11:31:27,034::vdsm::119::vds::(run) (PID: 10360) I am the actual vdsm 4.14.1-2.el6 aqua-vds5.qa.lab.tlv.redhat.com (2.6.32-431.1.2.el6.x86_64)


Lowering the proprity

--- Additional comment from Francesco Romani on 2014-02-13 03:41:39 EST ---

draft patch posted on gerrit

Comment 1 Francesco Romani 2014-02-24 07:37:02 UTC

backport of the fix to branch 3.3

Comment 2 Sandro Bonazzola 2014-03-31 12:28:48 UTC

This is an automated message. Moving to Closed CURRENTRELEASE since oVirt 3.3.4 has been released.

Note You need to log in before you can comment on or make changes to this bug.