Bug 732031

Summary: VM moved to Paused while hibernating multiple VMs
Product: Red Hat Enterprise Linux 6 Reporter: Jaroslav Henner <jhenner>
Component: vdsmAssignee: Eduardo Warszawski <ewarszaw>
Status: CLOSED WORKSFORME QA Contact: yeylon <yeylon>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.1CC: abaron, bazulay, danken, iheim, srevivo, ykaul
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-09-12 12:53:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 733920    
Bug Blocks:    
Attachments:
Description Flags
logs
none
events log none

Description Jaroslav Henner 2011-08-19 14:28:05 UTC
Created attachment 519048 [details]
logs

Description of problem:
When suspending four vms, I went into situation that the VM Migrations_pool-0 got paused.

In the vdsm.log:
Thread-481::DEBUG::2011-08-19 15:10:30,224::vm::739::vm.Vm::(setDownStatus) vmId=`29ed44b1-02cd-4c14-bbbc-e32926855202`::Changed state to Down: SaveState succeeded
Thread-481::DEBUG::2011-08-19 15:10:30,224::utils::314::vm.Vm::(stop) vmId=`29ed44b1-02cd-4c14-bbbc-e32926855202`::Stop statistics collection
Thread-221::ERROR::2011-08-19 15:10:30,227::utils::369::vm.Vm::(collect) vmId=`29ed44b1-02cd-4c14-bbbc-e32926855202`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x26ac0a0>
Traceback (most recent call last):
  File "/usr/share/vdsm/utils.py", line 366, in collect
    statsFunction()
  File "/usr/share/vdsm/utils.py", line 250, in __call__
    retValue = self._function(*args, **kwargs)
  File "/usr/share/vdsm/libvirtvm.py", line 65, in _highWrite
    self._vm._dom.blockInfo(vmDrive.path, 0)
  File "/usr/share/vdsm/libvirtvm.py", line 327, in f
    ret = attr(*args, **kwargs)
  File "/usr/share/vdsm/libvirtconnection.py", line 63, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1284, in blockInfo
    if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self)
libvirtError: Domain not found: no domain with matching uuid '29ed44b1-02cd-4c14-bbbc-e32926855202'
Thread-221::DEBUG::2011-08-19 15:10:30,228::utils::345::vm.Vm::(run) vmId=`29ed44b1-02cd-4c14-bbbc-e32926855202`::Stats thread finished
Thread-524::DEBUG::2011-08-19 15:10:30,364::clientIF::54::vds::(wrapper) [10.34.60.83]::call getVmStats with ('29ed44b1-02cd-4c14-bbbc-e32926855202',) {}
Thread-524::DEBUG::2011-08-19 15:10:30,364::clientIF::59::vds::(wrapper) return getVmStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Down', 'timeOffset': '0', 'vmId': '29ed44b1-02cd-4c14-bbbc-e32926855202', 'exitMessage': 'SaveState succeeded', 'exitCode': 0}]}


Version-Release number of selected component (if applicable):
vdsm-4.9-91.1.el6.x86_64
libvirt-0.9.4-0rc1.2.el6.x86_64
qemu-kvm-0.12.1.2-2.182.el6.x86_64
rhevm ic137

How reproducible:
Unknown

Steps to Reproduce:
Unknown
  
Actual results:
VM paused

Expected results:
VM suspends and then resumes.

Additional info:

Comment 1 Jaroslav Henner 2011-08-19 14:28:49 UTC
Created attachment 519049 [details]
events log

Comment 3 Jaroslav Henner 2011-08-19 14:34:53 UTC
I failed to mention that there is the follwing in the other host's vdsm.log.

Thread-99::DEBUG::2011-08-19 14:57:32,223::clientIF::225::Storage.Dispatcher.Protect::(wrapper) [10.34.60.83]
Thread-99::INFO::2011-08-19 14:57:32,223::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: getSpmStatus, args: ( spUUID=8791b66f-b117-4cb5-a5fd-3fdc81d6eb39)
Thread-99::DEBUG::2011-08-19 14:57:32,223::task::492::TaskManager.Task::(_debug) Task 440e7086-9e3b-4a37-92b7-50165eda691e: moving from state init -> state preparing
Thread-99::ERROR::2011-08-19 14:57:32,224::task::865::TaskManager.Task::(_setError) Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 873, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/spm.py", line 708, in public_getSpmStatus
    pool = hsm.HSM.getPool(spUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 125, in getPool
    raise se.StoragePoolUnknown(spUUID)
StoragePoolUnknown: Unknown pool id, pool not connected: ('8791b66f-b117-4cb5-a5fd-3fdc81d6eb39',)
Thread-99::DEBUG::2011-08-19 14:57:32,225::task::492::TaskManager.Task::(_debug) Task 440e7086-9e3b-4a37-92b7-50165eda691e: Task._run: 440e7086-9e3b-4a37-92b7-50165eda691e ('8791b66f-b117-4cb5-a5fd-3fdc81d6eb39',) {} failed - stopping task
Thread-99::DEBUG::2011-08-19 14:57:32,225::task::492::TaskManager.Task::(_debug) Task 440e7086-9e3b-4a37-92b7-50165eda691e: stopping in state preparing (force False)

Comment 4 Dan Kenigsberg 2011-08-21 07:59:42 UTC
I guess this is nothing more than log noise, but Edu, please take a better look.

Comment 5 Eduardo Warszawski 2011-08-23 12:13:13 UTC
getSpmStatus on the recently created pool fails because the host is not connected to this pool yet.

createPool not implies that the host is connected to this pool.

Can be observed that the next getSpmStatus, after the connectPool succeed.



Thread-82::INFO::2011-08-19 14:57:08,528::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: createStoragePool, args: ( poolType=3 spUUID=8791b66f-b117-4cb5-a5fd-3fdc81d6eb39 poolName=Migrations_DC masterDom=678a45a4-b5f8-418b-b541-19a6df9c78eb domList=['678a45a4-b5f8-418b-b541-19a6df9c78eb'] masterVersion=1 lockPolicy=ON lockRenewalIntervalSec=5 leaseTimeSec=60 ioOpTimeoutSec=10 leaseRetries=3)

14:57:31,342::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: createStoragePool, Return response: {'status': {'message': 'OK', 'code': 0}}
14:57:32,223::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: getSpmStatus, args: ( spUUID=8791b66f-b117-4cb5-a5fd-3fdc81d6eb39)
Thread-100::INFO::2011-08-19 

<<< FAILED >>>


14:57:32,247::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: connectStoragePool, args: ( spUUID=8791b66f-b117-4cb5-a5fd-3fdc81d6eb39 hostID=1 scsiKey=8791b66f-b117-4cb5-a5fd-3fdc81d6eb39 msdUUID=678a45a4-b5f8-418b-b541-19a6df9c78eb masterVersion=1)
Thread-100::INFO::2011-08-19 

14:57:35,623::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: connectStoragePool, Return response: {'status': {'message': 'OK', 'code': 0}}


Thread-110::INFO::2011-08-19 14:57:35,629::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: getSpmStatus, args: ( spUUID=8791b66f-b117-4cb5-a5fd-3fdc81d6eb39)

14:57:35,631::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: getSpmStatus, Return response: {'status': {'message': 'OK', 'code': 0}, 'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver': -1}}




Comment 6 Eduardo Warszawski 2011-08-23 12:46:39 UTC
In order to clarify the libvirt error the libvirt log is required.

Comment 7 Ayal Baron 2011-08-31 19:01:57 UTC
First of all title is totally incorrect and misleading.
Second, you suspended your VMs and they paused, what exactly is the problem?
There is some noise in the log indeed ("libvirtError: Domain not found: no domain with matching uuid") and should be handled, but I don't see it's important enough to pass as an Excepion.

Suspend operation succeeded as can be noted in:
Thread-524::DEBUG::2011-08-19 15:10:30,364::clientIF::59::vds::(wrapper) return
getVmStats with {'status': {'message': 'Done', 'code': 0}, 'statsList':
[{'status': 'Down', 'timeOffset': '0', 'vmId':
'29ed44b1-02cd-4c14-bbbc-e32926855202', 'exitMessage': 'SaveState succeeded',
'exitCode': 0}]}


Wrt the libvirt error, it looks like vdsm is calling virDomainGetBlockInfo after the VM status is set to down, is this supported at all? (Dan?)

Comment 8 Jaroslav Henner 2011-08-31 19:11:34 UTC
(In reply to comment #7)
> First of all title is totally incorrect and misleading.
Sorry about that. I could write something VM got paused when suspending it, but this would be too broad and I don't know how can I know what happen wrong.

> Second, you suspended your VMs and they paused, what exactly is the problem?
It should have got suspended, not paused. In rhevm terminology, the VM gets paused for example when the storage is inaccessible which was not the case.

Comment 9 Dan Kenigsberg 2011-09-01 23:09:32 UTC
(In reply to comment #7)
> Wrt the libvirt error, it looks like vdsm is calling virDomainGetBlockInfo
> after the VM status is set to down, is this supported at all? (Dan?)

we have a thread polling libvirt for info. When a VM dies (due to successful migration, for example) we may have a race between the event notifying us about it and the polling failure. That's what I meant by log noise.

Comment 10 Dan Kenigsberg 2011-09-04 08:51:29 UTC
Ok, so the bug here is that a VM moved to Paused state before successfully being hibernated (=saved) to disk.

Would you please reproduce this, and add matching libvirt.log as requested in comment 6.

Note that I do not consider this temporary state change as a high-severity bug; it may well not get into 3.0 at this point.

Comment 11 Jaroslav Henner 2011-09-05 14:04:30 UTC
I'm sorry, I couldn't reproduce. I made like 10 suspends without any problem like this.

Comment 12 Dan Kenigsberg 2011-09-05 15:12:47 UTC
Are you using latest libvirt version now?

In any case, if it works for you now, there's bug resolution for that.

Comment 13 Jaroslav Henner 2011-09-12 12:53:04 UTC
Yes, I worked with the latest libvirt.