Bug 1063336 - vdsm not reporting a VM that is reported running by libvirt
Summary: vdsm not reporting a VM that is reported running by libvirt
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: oVirt
Classification: Retired
Component: vdsm
Version: 3.4
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 3.4.0
Assignee: Francesco Romani
QA Contact: Raz Tamir
URL:
Whiteboard: virt
Depends On:
Blocks: 1069095
TreeView+ depends on / blocked
 
Reported: 2014-02-10 14:19 UTC by Raz Tamir
Modified: 2014-03-31 12:24 UTC (History)
10 users (show)

Fixed In Version: vdsm-4.14.5
Clone Of:
: 1069095 (view as bug list)
Environment:
Last Closed: 2014-03-31 12:24:37 UTC
oVirt Team: ---
Embargoed:


Attachments (Terms of Use)
engine/vdsm logs (569.07 KB, application/x-gzip)
2014-02-10 14:19 UTC, Raz Tamir
no flags Details
new engine/vdsm logs (59.49 KB, application/x-gzip)
2014-02-12 09:57 UTC, Raz Tamir
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 24423 0 None None None Never
oVirt gerrit 24821 0 None None None Never
oVirt gerrit 24822 0 None MERGED vm: recover must finish with libvirt connection Never

Description Raz Tamir 2014-02-10 14:19:53 UTC
Created attachment 861402 [details]
engine/vdsm logs

Description of problem:
Environment:
- Setup with 2 vms (1 disk each), 2 storage domains.
- Each vm has a disk on a different SD. 
- First vm with snapshot.
- second vm has attached snapshot disk of first vm (through backup api)

when blocking connection from host that vms are running on to the storage domain that contains the first vm's disk (i.e. the vm with the snapshot), the second vm should be paused due to EIO.
the problem is, when unblocking the connection and trying to run the paused vm, it fails with error.
* In vdsm logs, when trying to resume the paused vm also fails with error:
-->

* Powering off the paused vm works, but then trying to start it fails with the following error:
'Changed state to Down: Requested operation is not valid: domain 'v
m_0' is already active'.

output **before** shutting the paused vm:
# vdsClient -s 0 list
ba8a2610-aa49-4505-ab12-ca6b9ac724a3  19207  vm_0                 Paused

* After powering off the paused vm, libvirt still shows *both* vms as running:
# virsh -r list
 Id    Name                           State
----------------------------------------------------
 27    vm_0                           running
 29    vm_1                           running

However, vdsm reports that there are no running vms, and so does the engine/UI

# vdsClient -s 0 list
None

Restarting vdsm on the host solves the problem (vdsm kills both qemu process after recovering from the restart)



Version-Release number of selected component (if applicable):
vdsm-4.14.1-2.el6.x86_64
libvirt-0.10.2-29.el6.1.x86_64
ovirt-engine-3.4.0-0.5.beta1.el6.noarch


How reproducible:
100%

Steps to Reproduce:
Environment:
- Setup with 2 vms (1 disk each), 2 storage domains (on same storage server).
- Each vm has a disk on a different SD. 
- First vm with snapshot.
- second vm has snapshot of first vm's disk attached (through backup api)
1. Run both vms on same host
2. Block connection from host to storage domain containing first vm's disk
3. both vms should enter Paused state
4. Unblock connection
5. Try to resume vms or power off and rerun the vms

Actual results:
Step 5 fails as described above

Expected results:
vms should recover from paused state when connection was unblocked

Additional info:

Comment 1 Francesco Romani 2014-02-10 17:33:07 UTC
Hi, are the attached one the complete logs? Could you please provide some timing information of the steps you executed:
- at which time the connection was unblocked?
- at which time you try to resume the VMs?

Comment 2 Raz Tamir 2014-02-12 09:57:22 UTC
Created attachment 862181 [details]
new engine/vdsm logs

Comment 3 Raz Tamir 2014-02-12 09:57:44 UTC
Hi, 
I attached new clean logs:
- connection was blocked at: 11:29:39
- VMs were paused at: 11:35:45
- connection was unblocked at: 11:36
- resume the VMs at: 11:37:56
- power off VMs: 11:38:28
- rerun VMs: 11:39:41

Comment 4 Francesco Romani 2014-02-12 10:23:42 UTC
Thanks, the information you provided are very helpful and helped me to quickly find the problem.

The root cause of the issue is the following:
* if, in the recovery path, VDSM is unable to access any of the volumes, it raises an exception (see [1] below).
* The exception is handled, but the lookup of the domain in libvirt is skipped
(it should happen just after the buildConfDevices step).
* as result, a Vm object is wrongly registered inside VDSM but without the established connection to libvirt, which is represented by its _dom attribute.
* this in turn causes the failure of the vmCont operation (see [2] below) and the disalignment of the state between VDSM and libvirt.

Log exceperts demonstrating the above:

[1]
Thread-13::ERROR::2014-02-12 11:32:34,718::vm::1991::vm.Vm::(_normalizeVdsmImg) vmId=`d200dff9-5656-402f-8cc4-9e1a5c68e938`::Unable to get volume size for ede4f955-549c-4f97-b103-127ee2b3672d
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 1987, in _normalizeVdsmImg
    drv['truesize'] = res['truesize']
KeyError: 'truesize'
Thread-13::INFO::2014-02-12 11:32:34,719::vm::2244::vm.Vm::(_startUnderlyingVm) vmId=`d200dff9-5656-402f-8cc4-9e1a5c68e938`::Skipping errors on recovery
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 2233, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/vm.py", line 3093, in _run
    devices = self.buildConfDevices()
  File "/usr/share/vdsm/vm.py", line 2061, in buildConfDevices
    self._normalizeVdsmImg(drv)
  File "/usr/share/vdsm/vm.py", line 1993, in _normalizeVdsmImg
    drv['volumeID'])
RuntimeError: Volume ede4f955-549c-4f97-b103-127ee2b3672d is corrupted or missing
Thread-13::DEBUG::2014-02-12 11:32:34,720::vm::2247::vm.Vm::(_startUnderlyingVm) vmId=`d200dff9-5656-402f-8cc4-9e1a5c68e938`::_ongoingCreations released

[2]
Thread-166::DEBUG::2014-02-12 11:37:56,327::BindingXMLRPC::970::vds::(wrapper) client [10.35.161.54]::call vmCont with ('d200dff9-5656-402f-8cc4-9e1a5c68e938',) {} flowID [28e177e]
Thread-166::ERROR::2014-02-12 11:37:56,327::BindingXMLRPC::989::vds::(wrapper) unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 973, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 209, in vmCont
    return vm.cont()
  File "/usr/share/vdsm/API.py", line 154, in cont
    return v.cont()
  File "/usr/share/vdsm/vm.py", line 2587, in cont
    self._underlyingCont()
  File "/usr/share/vdsm/vm.py", line 3726, in _underlyingCont
    hooks.before_vm_cont(self._dom.XMLDesc(0), self.conf)
AttributeError: 'NoneType' object has no attribute 'XMLDesc'

Comment 5 Francesco Romani 2014-02-12 11:13:17 UTC
Worth to remark that VDSM was restarted while the storage was blocked/unavailble (following the times you provided) and then triggered the recovery and all the following:

MainThread::DEBUG::2014-02-12 11:30:45,946::vdsm::55::vds::(sigtermHandler) Received signal 15

MainThread::DEBUG::2014-02-12 11:30:45,946::vdsm::55::vds::(sigtermHandler) Received signal 15
MainThread::INFO::2014-02-12 11:30:46,898::vmChannels::183::vds::(stop) VM channels listener was stopped.
MainThread::INFO::2014-02-12 11:30:46,898::momIF::84::MOM::(stop) Shutting down MOM
MainThread::INFO::2014-02-12 11:30:46,899::logUtils::44::dispatcher::(wrapper) Run and protect: prepareForShutdown(options=None)
MainThread::DEBUG::2014-02-12 11:30:46,901::sp::380::Storage.StoragePool::(cleanupMasterMount) master `/rhev/data-center/mnt/blockSD/01a745ee-26ee-4e73-a99a-05e791e9a5eb/master` is not mounted, skipping
MainThread::DEBUG::2014-02-12 11:30:46,902::sp::380::Storage.StoragePool::(cleanupMasterMount) master `/rhev/data-center/mnt/blockSD/74ee9bdc-d6cd-49af-bbd4-4e8e2a009a25/master` is not mounted, skipping
MainThread::DEBUG::2014-02-12 11:30:46,947::vdsm::55::vds::(sigtermHandler) Received signal 15
MainThread::DEBUG::2014-02-12 11:30:46,947::clientIF::207::vds::(prepareForShutdown) cannot run prepareForShutdown concurrently
MainThread::INFO::2014-02-12 11:30:47,294::domainMonitor::125::Storage.DomainMonitor::(close) Stopping domain monitors
MainThread::INFO::2014-02-12 11:30:47,295::domainMonitor::113::Storage.DomainMonitor::(stopMonitoring) Stop monitoring 13f48873-6f42-48fe-a4f5-1fa3c19a6f39
Thread-36::DEBUG::2014-02-12 11:30:47,295::domainMonitor::175::Storage.DomainMonitorThread::(_monitorLoop) Stopping domain monitor for 13f48873-6f42-48fe-a4f5-1fa3c19a6f39
VM Channels Listener::INFO::2014-02-12 11:30:47,595::vmChannels::178::vds::(run) VM channels listener thread has ended.
Thread-37::DEBUG::2014-02-12 11:30:53,137::domainMonitor::192::Storage.DomainMonitorThread::(_monitorDomain) Refreshing domain 0639c74b-0a19-47c4-919a-105902edefbc
MainThread::INFO::2014-02-12 11:31:27,034::vdsm::119::vds::(run) (PID: 10360) I am the actual vdsm 4.14.1-2.el6 aqua-vds5.qa.lab.tlv.redhat.com (2.6.32-431.1.2.el6.x86_64)


Lowering the proprity

Comment 6 Francesco Romani 2014-02-13 08:41:39 UTC
draft patch posted on gerrit

Comment 7 Sandro Bonazzola 2014-03-03 15:24:00 UTC
This is an automated message.

This BZ should be fixed in oVirt 3.4.0 RC repository, assignee please update the status of the bug to ON_QA once you've verified that the change is included in the build.

Comment 8 Francesco Romani 2014-03-04 08:12:23 UTC
Verified downloading the 3.4.0 RC packages from the ovirt repo.
This fix is not included in VDSM 4.14.3, it was merged *after* the tag has been created.

Comment 9 Michal Skrivanek 2014-03-26 16:29:33 UTC
it's in 3.4 rc2

Comment 11 Sandro Bonazzola 2014-03-31 12:24:37 UTC
this is an automated message: moving to Closed CURRENT RELEASE since oVirt 3.4.0 has been released


Note You need to log in before you can comment on or make changes to this bug.