Bug 784002

Summary: Guest cpu lock is not unlocked in case guest was paused due to storage problems
Product: [Retired] oVirt Reporter: Jakub Libosvar <jlibosva>
Component: vdsmAssignee: Dan Kenigsberg <danken>
Status: CLOSED WORKSFORME QA Contact: yeylon <yeylon>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: abaron, acathrow, bazulay, iheim, srevivo, ykaul
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-01-24 13:49:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Libvirt, machine, vdsm, engine logs none

Description Jakub Libosvar 2012-01-23 14:18:39 UTC
Created attachment 556975 [details]
Libvirt, machine, vdsm, engine logs

Description of problem:
Cannot resume paused VM after storage problems. My setup was following:
two hosts, two iscsi domains - each on different server. Two VMs were running on SPM host and on both hosts was blocked storage to master SD which VMs had it's volumes on. After new master SD was elected and VMs were paused, former SPM went to Non Operational state and new SPM was elected. I unblocked storage connections on both hosts and then activated Non Operational host. Then I activated Inactive storage domain and attempted to run both machines. Only one machine started running. The second one fails on acquiring guest cpu lock. There is running qemu process for both VMs on the host. 


Version-Release number of selected component (if applicable):
vdsm-4.9.3.1-0.fc16.x86_64
libvirt-client-0.9.6-4.fc16.x86_64
qemu-kvm-0.15.1-3.fc16.x86_64
libvirt-0.9.6-4.fc16.x86_64
qemu-kvm-tools-0.15.1-3.fc16.x86_64
libvirt-python-0.9.6-4.fc16.x86_64
ovirt-engine-3.0.0_0001-1.2.fc16.x86_64

How reproducible:
5 out of 5 times but only on one VM

Steps to Reproduce:
1. See Description
2.
3.
  
Actual results:
VM cannot resume

Expected results:
VM is resumed

Additional info:

I have also noticed this line in vdsm log after VM was paused: 
Thread-16::WARNING::2012-01-23 14:03:22,941::libvirtvm::1180::vm.Vm::(_readPauseCode) vmId=`b28ff353-258f-4cb9-9a05-8822a9c3b218`::_readPauseCode unsupported by libvirt vm

Thread-310::DEBUG::2012-01-23 14:16:11,326::clientIF::76::vds::(wrapper) [10.34.63.26]::call cont with ('b28ff353-258f-4cb9-9a05-8822a9c3b218',) {}
Thread-310::ERROR::2012-01-23 14:17:17,354::clientIF::90::vds::(wrapper) Traceback (most recent call last):
  File "/usr/share/vdsm/clientIF.py", line 80, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/clientIF.py", line 499, in cont
    return v.cont()
  File "/usr/share/vdsm/vm.py", line 773, in cont
    self._acquireCpuLockWithTimeout()
  File "/usr/share/vdsm/vm.py", line 769, in _acquireCpuLockWithTimeout
    timeout)
RuntimeError: waiting more that 66s for _guestCpuLock

Comment 1 Jakub Libosvar 2012-01-23 14:38:41 UTC
I failed to reproduced using only one host.

Comment 2 Dan Kenigsberg 2012-01-23 15:40:41 UTC
Could it be that the lock is not released due to qemu not responding?
what is the domain state in virsh? Does it respond to suspend/resume?
Does the problem remain after vdsm is restarted?

Comment 3 Jakub Libosvar 2012-01-24 13:49:12 UTC
I cannot reproduce anymore (tried 6 times). Closing as worksforme for now.