Bug 732031
Summary: | VM moved to Paused while hibernating multiple VMs | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Jaroslav Henner <jhenner> | ||||||
Component: | vdsm | Assignee: | Eduardo Warszawski <ewarszaw> | ||||||
Status: | CLOSED WORKSFORME | QA Contact: | yeylon <yeylon> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 6.1 | CC: | abaron, bazulay, danken, iheim, srevivo, ykaul | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2011-09-12 12:53:04 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 733920 | ||||||||
Bug Blocks: | |||||||||
Attachments: |
|
Created attachment 519049 [details]
events log
I failed to mention that there is the follwing in the other host's vdsm.log. Thread-99::DEBUG::2011-08-19 14:57:32,223::clientIF::225::Storage.Dispatcher.Protect::(wrapper) [10.34.60.83] Thread-99::INFO::2011-08-19 14:57:32,223::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: getSpmStatus, args: ( spUUID=8791b66f-b117-4cb5-a5fd-3fdc81d6eb39) Thread-99::DEBUG::2011-08-19 14:57:32,223::task::492::TaskManager.Task::(_debug) Task 440e7086-9e3b-4a37-92b7-50165eda691e: moving from state init -> state preparing Thread-99::ERROR::2011-08-19 14:57:32,224::task::865::TaskManager.Task::(_setError) Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 873, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/spm.py", line 708, in public_getSpmStatus pool = hsm.HSM.getPool(spUUID) File "/usr/share/vdsm/storage/hsm.py", line 125, in getPool raise se.StoragePoolUnknown(spUUID) StoragePoolUnknown: Unknown pool id, pool not connected: ('8791b66f-b117-4cb5-a5fd-3fdc81d6eb39',) Thread-99::DEBUG::2011-08-19 14:57:32,225::task::492::TaskManager.Task::(_debug) Task 440e7086-9e3b-4a37-92b7-50165eda691e: Task._run: 440e7086-9e3b-4a37-92b7-50165eda691e ('8791b66f-b117-4cb5-a5fd-3fdc81d6eb39',) {} failed - stopping task Thread-99::DEBUG::2011-08-19 14:57:32,225::task::492::TaskManager.Task::(_debug) Task 440e7086-9e3b-4a37-92b7-50165eda691e: stopping in state preparing (force False) I guess this is nothing more than log noise, but Edu, please take a better look. getSpmStatus on the recently created pool fails because the host is not connected to this pool yet. createPool not implies that the host is connected to this pool. Can be observed that the next getSpmStatus, after the connectPool succeed. Thread-82::INFO::2011-08-19 14:57:08,528::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: createStoragePool, args: ( poolType=3 spUUID=8791b66f-b117-4cb5-a5fd-3fdc81d6eb39 poolName=Migrations_DC masterDom=678a45a4-b5f8-418b-b541-19a6df9c78eb domList=['678a45a4-b5f8-418b-b541-19a6df9c78eb'] masterVersion=1 lockPolicy=ON lockRenewalIntervalSec=5 leaseTimeSec=60 ioOpTimeoutSec=10 leaseRetries=3) 14:57:31,342::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: createStoragePool, Return response: {'status': {'message': 'OK', 'code': 0}} 14:57:32,223::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: getSpmStatus, args: ( spUUID=8791b66f-b117-4cb5-a5fd-3fdc81d6eb39) Thread-100::INFO::2011-08-19 <<< FAILED >>> 14:57:32,247::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: connectStoragePool, args: ( spUUID=8791b66f-b117-4cb5-a5fd-3fdc81d6eb39 hostID=1 scsiKey=8791b66f-b117-4cb5-a5fd-3fdc81d6eb39 msdUUID=678a45a4-b5f8-418b-b541-19a6df9c78eb masterVersion=1) Thread-100::INFO::2011-08-19 14:57:35,623::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: connectStoragePool, Return response: {'status': {'message': 'OK', 'code': 0}} Thread-110::INFO::2011-08-19 14:57:35,629::dispatcher::94::Storage.Dispatcher.Protect::(run) Run and protect: getSpmStatus, args: ( spUUID=8791b66f-b117-4cb5-a5fd-3fdc81d6eb39) 14:57:35,631::dispatcher::100::Storage.Dispatcher.Protect::(run) Run and protect: getSpmStatus, Return response: {'status': {'message': 'OK', 'code': 0}, 'spm_st': {'spmId': -1, 'spmStatus': 'Free', 'spmLver': -1}} In order to clarify the libvirt error the libvirt log is required. First of all title is totally incorrect and misleading. Second, you suspended your VMs and they paused, what exactly is the problem? There is some noise in the log indeed ("libvirtError: Domain not found: no domain with matching uuid") and should be handled, but I don't see it's important enough to pass as an Excepion. Suspend operation succeeded as can be noted in: Thread-524::DEBUG::2011-08-19 15:10:30,364::clientIF::59::vds::(wrapper) return getVmStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Down', 'timeOffset': '0', 'vmId': '29ed44b1-02cd-4c14-bbbc-e32926855202', 'exitMessage': 'SaveState succeeded', 'exitCode': 0}]} Wrt the libvirt error, it looks like vdsm is calling virDomainGetBlockInfo after the VM status is set to down, is this supported at all? (Dan?) (In reply to comment #7) > First of all title is totally incorrect and misleading. Sorry about that. I could write something VM got paused when suspending it, but this would be too broad and I don't know how can I know what happen wrong. > Second, you suspended your VMs and they paused, what exactly is the problem? It should have got suspended, not paused. In rhevm terminology, the VM gets paused for example when the storage is inaccessible which was not the case. (In reply to comment #7) > Wrt the libvirt error, it looks like vdsm is calling virDomainGetBlockInfo > after the VM status is set to down, is this supported at all? (Dan?) we have a thread polling libvirt for info. When a VM dies (due to successful migration, for example) we may have a race between the event notifying us about it and the polling failure. That's what I meant by log noise. Ok, so the bug here is that a VM moved to Paused state before successfully being hibernated (=saved) to disk. Would you please reproduce this, and add matching libvirt.log as requested in comment 6. Note that I do not consider this temporary state change as a high-severity bug; it may well not get into 3.0 at this point. I'm sorry, I couldn't reproduce. I made like 10 suspends without any problem like this. Are you using latest libvirt version now? In any case, if it works for you now, there's bug resolution for that. Yes, I worked with the latest libvirt. |
Created attachment 519048 [details] logs Description of problem: When suspending four vms, I went into situation that the VM Migrations_pool-0 got paused. In the vdsm.log: Thread-481::DEBUG::2011-08-19 15:10:30,224::vm::739::vm.Vm::(setDownStatus) vmId=`29ed44b1-02cd-4c14-bbbc-e32926855202`::Changed state to Down: SaveState succeeded Thread-481::DEBUG::2011-08-19 15:10:30,224::utils::314::vm.Vm::(stop) vmId=`29ed44b1-02cd-4c14-bbbc-e32926855202`::Stop statistics collection Thread-221::ERROR::2011-08-19 15:10:30,227::utils::369::vm.Vm::(collect) vmId=`29ed44b1-02cd-4c14-bbbc-e32926855202`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x26ac0a0> Traceback (most recent call last): File "/usr/share/vdsm/utils.py", line 366, in collect statsFunction() File "/usr/share/vdsm/utils.py", line 250, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/libvirtvm.py", line 65, in _highWrite self._vm._dom.blockInfo(vmDrive.path, 0) File "/usr/share/vdsm/libvirtvm.py", line 327, in f ret = attr(*args, **kwargs) File "/usr/share/vdsm/libvirtconnection.py", line 63, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1284, in blockInfo if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self) libvirtError: Domain not found: no domain with matching uuid '29ed44b1-02cd-4c14-bbbc-e32926855202' Thread-221::DEBUG::2011-08-19 15:10:30,228::utils::345::vm.Vm::(run) vmId=`29ed44b1-02cd-4c14-bbbc-e32926855202`::Stats thread finished Thread-524::DEBUG::2011-08-19 15:10:30,364::clientIF::54::vds::(wrapper) [10.34.60.83]::call getVmStats with ('29ed44b1-02cd-4c14-bbbc-e32926855202',) {} Thread-524::DEBUG::2011-08-19 15:10:30,364::clientIF::59::vds::(wrapper) return getVmStats with {'status': {'message': 'Done', 'code': 0}, 'statsList': [{'status': 'Down', 'timeOffset': '0', 'vmId': '29ed44b1-02cd-4c14-bbbc-e32926855202', 'exitMessage': 'SaveState succeeded', 'exitCode': 0}]} Version-Release number of selected component (if applicable): vdsm-4.9-91.1.el6.x86_64 libvirt-0.9.4-0rc1.2.el6.x86_64 qemu-kvm-0.12.1.2-2.182.el6.x86_64 rhevm ic137 How reproducible: Unknown Steps to Reproduce: Unknown Actual results: VM paused Expected results: VM suspends and then resumes. Additional info: