Bug 797526

Summary: 3.1 - vm recovery fail after vdsm is restarted as we try to restore vm before pool is connected
Product: Red Hat Enterprise Linux 6 Reporter: Haim <hateya>
Component: vdsmAssignee: Federico Simoncelli <fsimonce>
Status: CLOSED ERRATA QA Contact: Dafna Ron <dron>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.3CC: abaron, acathrow, bazulay, chayang, cpelland, danken, dron, ewarszaw, iheim, ilvovsky, jbiddle, mgoldboi, yeylon, ykaul
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: vdsm-4.9.6-12 Doc Type: Bug Fix
Doc Text:
Previously, when attempting to restart a virtual machine through VDSM, the operation would fail, as the VDSM attempted to recover the virtual machine before it was re-connected to the pool. This caused the virtual machine to appear to be running in libvirt, but paused in the VDSM. Now, the evaluation occurs correctly so that the virtual machine is first re-connected to the pool before VDSM attempts to restart it.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-12-04 18:53:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vdsm log
none
unexpected exception (Error code:16) on vdsm-13 none

Description Haim 2012-02-26 14:13:17 UTC
Description of problem:

vm recover flow fails after service is restarted due lack of connection to pool. 
when i search the logs, i clearly see that recover is called before the connection to pool. 

Thread-35::DEBUG::2012-02-26 15:25:02,384::vm::565::vm.Vm::(_startUnderlyingVm) vmId=`a194d529-2a5a-4ff0-aced-942ec0a41d8f`::_ongoingCreations acquired
Thread-35::INFO::2012-02-26 15:25:02,387::libvirtvm::1229::vm.Vm::(_run) vmId=`a194d529-2a5a-4ff0-aced-942ec0a41d8f`::VM wrapper has started
Thread-35::WARNING::2012-02-26 15:25:02,388::vm::396::vm.Vm::(getConfDevices) vmId=`a194d529-2a5a-4ff0-aced-942ec0a41d8f`::Unknown type found, device: '{'device': u'unix', 'alias': u'channel0', 'type': u'channel', 'address': {u'bus': u'
0', u'controller': u'0', u'type': u'virtio-serial', u'port': u'1'}}' found
Thread-35::WARNING::2012-02-26 15:25:02,389::vm::396::vm.Vm::(getConfDevices) vmId=`a194d529-2a5a-4ff0-aced-942ec0a41d8f`::Unknown type found, device: '{'device': u'spicevmc', 'alias': u'channel1', 'type': u'channel', 'address': {u'bus'
: u'0', u'controller': u'0', u'type': u'virtio-serial', u'port': u'2'}}' found
Thread-35::INFO::2012-02-26 15:25:02,391::vm::574::vm.Vm::(_startUnderlyingVm) vmId=`a194d529-2a5a-4ff0-aced-942ec0a41d8f`::Skipping errors on recovery
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 567, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/libvirtvm.py", line 1272, in _run
    self._devices[devType].append(devClass(self.conf, self.log, **dev))
  File "/usr/share/vdsm/libvirtvm.py", line 989, in __init__
    self.blockDev = utils.isBlockDevice(self.path)
  File "/usr/share/vdsm/utils.py", line 52, in isBlockDevice
    return stat.S_ISBLK(os.stat(path).st_mode)
OSError: [Errno 2] No such file or directory: '/rhev/data-center/711a080f-4702-450c-9f5f-bf54f1e99383/45290416-1dd4-494d-8ac9-29471752bd21/images/4d7d52fa-ff77-4089-a19c-9fb6a2f47e39/78c1e6de-ad7e-4f94-8ab9-3f748f75d8a0'

Repro:

- run vms 
- restart vdsm 
- recover vm flow starts
- recover vm flow fails 

Notes:


- vms are reported as running in libvirt 
- vms are reported as paused on vdsm 

git commit: b39d55059fc6cbc4fb3de522e45ea2381f68a3a7

Comment 1 Haim 2012-02-26 14:16:00 UTC
Created attachment 565897 [details]
vdsm log

Comment 6 Royce Lv 2012-05-09 10:31:35 UTC
1.HSM:__init__
  def storageRefresh():
    try:
      self.__cleanStorageRepository()-->this cleans the spUUID directory
    ....
    for spUUID in dirList:
      self.log.info("lvroyce find storage pool file")
      poolPath = os.path.join(self.storage_repository, spUUID)
      try:
        if os.path.exists(poolPath):-->will not exsisted and will not connect
          self.log.info("lvroyce find storage pool path")
          self._connectStoragePool(spUUID, None, None, None, None)

2.threading.Thread(target=storageRefresh).start() is a thread
  recover vm is also a thread
  even if storageRefresh connectted the pool,if it is later then recover vm,the recover process will fail. can be seen from your log:

Thread-49::INFO::2012-02-26 14:39:13,470::vm::574::vm.Vm::(_startUnderlyingVm) vmId=`34c67b37-9ace-4a39-9040-71cdc0f1e397`::Skipping errors on recovery
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 567, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/libvirtvm.py", line 1272, in _run
    self._devices[devType].append(devClass(self.conf, self.log, **dev))
  File "/usr/share/vdsm/libvirtvm.py", line 989, in __init__
    self.blockDev = utils.isBlockDevice(self.path)
  File "/usr/share/vdsm/utils.py", line 52, in isBlockDevice
    return stat.S_ISBLK(os.stat(path).st_mode)

Thread-11::DEBUG::2012-02-26 14:39:13,534::hsm::353::Storage.HSM::(__cleanStorageRepository) Finished cleaning storage repository at '/rhev/data-center'

Comment 7 Federico Simoncelli 2012-05-10 18:25:56 UTC
BZ#797526 Wait pool connection before recovery

In this patch:
* Serialize connectStoragePool and _recoverExistingVms
* Don't check if the pool directory is present to reconnect to the pool
  (_connectStoragePool), in fact the pool directory was removed few
   lines before by __cleanStorageRepository
* Prepare the volume paths (preparePaths) also during the recovery

Signed-off-by: Federico Simoncelli <fsimonce>
Change-Id: I418ce9641d1cdb7f7202fd8c08e7c027d1f6d325

http://gerrit.ovirt.org/4283

Comment 8 Federico Simoncelli 2012-05-16 11:17:37 UTC
Recently I proposed a better fix for the issue:

BZ#797526 Lazy evaluation for Drive.blockDev

Signed-off-by: Federico Simoncelli <fsimonce>
Change-Id: I93bc95650544372cec8883a8b7b568ecf419c129

http://gerrit.ovirt.org/4339

Comment 9 Federico Simoncelli 2012-05-16 18:24:26 UTC
*** Bug 819460 has been marked as a duplicate of this bug. ***

Comment 11 Chao Yang 2012-06-05 06:03:48 UTC
I am changing this bug to ASSIGNED, since I hit this issue again on vdsm-4.9.6-13.el6.x86_64. If any mistake, please move it back. 

A slice of vdsm log, will upload the whole vdsm log soon:
Thread-231::DEBUG::2012-06-05 13:25:14,627::BindingXMLRPC::864::vds::(wrapper) client [10.66.11.208]::call vmCont with ('fdabf547-909a-4242-b957-be6d60bf1ca7',) {}
Thread-231::ERROR::2012-06-05 13:25:14,629::BindingXMLRPC::879::vds::(wrapper) unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 869, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 210, in vmCont
    return vm.cont()
  File "/usr/share/vdsm/API.py", line 121, in cont
    return v.cont()
  File "/usr/share/vdsm/vm.py", line 823, in cont
    self._underlyingCont()
  File "/usr/share/vdsm/libvirtvm.py", line 1581, in _underlyingCont
    hooks.before_vm_cont(self._dom.XMLDesc(0), self.conf)
AttributeError: 'NoneType' object has no attribute 'XMLDesc'

# vdsClient -s 0 list table
5fd21790-5d3b-4be3-accd-d05cec96db86  15816  win7-16-6-1          Up                   10.66.108.78 10.66.109.239
404eacb0-ac9d-40a8-80ff-ebaecccc97bd   6919  win2k8r2-16-6-2      Up                   10.66.110.78 10.66.107.81
39a88c9e-35a8-4fde-b1fa-d5a0e6534319   7240  rhel6u3-16-6-1       Up                   10.66.106.15 10.66.109.250
bf8c3501-94e4-40b3-93a9-b7d4a23b7ccc   2331  rhel5u8-16-6-2       Up                   10.66.108.253 10.66.110.249
68b2308d-df40-475f-94d2-c396d0857233  13498  win2k8r2-16-6-1      Up                   10.66.107.227 10.66.105.3
fdabf547-909a-4242-b957-be6d60bf1ca7  12617  win7-16-6-2          Paused                                   
cb9c3a4c-f1e5-462a-ba4a-cf1973c00485   5957  rhel6u2-16-6-1       Up                   10.66.108.255 10.66.107.254
91192d02-5ecf-43ec-b5d3-6b616e6fc513   2835  rhel5u8-16-6-1       Up                   10.66.110.252       
9856096f-d7fc-4285-b7f8-a924ab04201e   3393  rhel6u2-16-6-2       Up                   10.66.110.250 10.66.110.248
10b0f8ec-24ad-45c5-b72e-5f549d202337   5415  rhel6u3-16-6-2       Up                   10.66.109.252 10.66.106.244
# vdsClient -s 0 continue fdabf547-909a-4242-b957-be6d60bf1ca7
Unexpected exception

Comment 12 Chao Yang 2012-06-05 06:05:27 UTC
Created attachment 589388 [details]
unexpected exception (Error code:16) on vdsm-13

# rpm -q vdsm libvirt qemu-kvm-rhev
vdsm-4.9.6-13.el6.x86_64
libvirt-0.9.10-21.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.294.el6.x86_64

Comment 13 Dan Kenigsberg 2012-06-05 09:12:03 UTC
moving to rhel-6.3.z as this is a rhev-3.1 blocker, not a rhel-6.3.0 one.

Comment 14 Federico Simoncelli 2012-06-05 11:15:43 UTC
The VM object initialization was stopped by such 
(In reply to comment #11)
> I am changing this bug to ASSIGNED, since I hit this issue again on
> vdsm-4.9.6-13.el6.x86_64. If any mistake, please move it back. 
> 
> A slice of vdsm log, will upload the whole vdsm log soon:
> Thread-231::DEBUG::2012-06-05
> 13:25:14,627::BindingXMLRPC::864::vds::(wrapper) client [10.66.11.208]::call
> vmCont with ('fdabf547-909a-4242-b957-be6d60bf1ca7',) {}
> Thread-231::ERROR::2012-06-05
> 13:25:14,629::BindingXMLRPC::879::vds::(wrapper) unexpected error
> Traceback (most recent call last):
>   File "/usr/share/vdsm/BindingXMLRPC.py", line 869, in wrapper
>     res = f(*args, **kwargs)
>   File "/usr/share/vdsm/BindingXMLRPC.py", line 210, in vmCont
>     return vm.cont()
>   File "/usr/share/vdsm/API.py", line 121, in cont
>     return v.cont()
>   File "/usr/share/vdsm/vm.py", line 823, in cont
>     self._underlyingCont()
>   File "/usr/share/vdsm/libvirtvm.py", line 1581, in _underlyingCont
>     hooks.before_vm_cont(self._dom.XMLDesc(0), self.conf)
> AttributeError: 'NoneType' object has no attribute 'XMLDesc'

The traceback and the logs you attached don't contain anything related to comment #0. In fact the issue here is generated by something different:

Traceback (most recent call last):
  File "/usr/share/vdsm/clientIF.py", line 275, in _recoverExistingVms
    vmObj.preparePaths(vmObj.getConfDevices()[vm.DISK_DEVICES])
  File "/usr/share/vdsm/vm.py", line 402, in getConfDevices
    self.normalizeDrivesIndices(devices[DISK_DEVICES])
  File "/usr/share/vdsm/vm.py", line 540, in normalizeDrivesIndices
    if not self._usedIndices.has_key(drv['iface']):
KeyError: 'iface'

Comment 15 Chao Yang 2012-06-05 11:26:28 UTC
(In reply to comment #14)
> The VM object initialization was stopped by such 
> (In reply to comment #11)
> > I am changing this bug to ASSIGNED, since I hit this issue again on
> > vdsm-4.9.6-13.el6.x86_64. If any mistake, please move it back. 
> > 
> > A slice of vdsm log, will upload the whole vdsm log soon:
> > Thread-231::DEBUG::2012-06-05
> > 13:25:14,627::BindingXMLRPC::864::vds::(wrapper) client [10.66.11.208]::call
> > vmCont with ('fdabf547-909a-4242-b957-be6d60bf1ca7',) {}
> > Thread-231::ERROR::2012-06-05
> > 13:25:14,629::BindingXMLRPC::879::vds::(wrapper) unexpected error
> > Traceback (most recent call last):
> >   File "/usr/share/vdsm/BindingXMLRPC.py", line 869, in wrapper
> >     res = f(*args, **kwargs)
> >   File "/usr/share/vdsm/BindingXMLRPC.py", line 210, in vmCont
> >     return vm.cont()
> >   File "/usr/share/vdsm/API.py", line 121, in cont
> >     return v.cont()
> >   File "/usr/share/vdsm/vm.py", line 823, in cont
> >     self._underlyingCont()
> >   File "/usr/share/vdsm/libvirtvm.py", line 1581, in _underlyingCont
> >     hooks.before_vm_cont(self._dom.XMLDesc(0), self.conf)
> > AttributeError: 'NoneType' object has no attribute 'XMLDesc'
> 
> The traceback and the logs you attached don't contain anything related to
> comment #0. In fact the issue here is generated by something different:
> 
> Traceback (most recent call last):
>   File "/usr/share/vdsm/clientIF.py", line 275, in _recoverExistingVms
>     vmObj.preparePaths(vmObj.getConfDevices()[vm.DISK_DEVICES])
>   File "/usr/share/vdsm/vm.py", line 402, in getConfDevices
>     self.normalizeDrivesIndices(devices[DISK_DEVICES])
>   File "/usr/share/vdsm/vm.py", line 540, in normalizeDrivesIndices
>     if not self._usedIndices.has_key(drv['iface']):
> KeyError: 'iface'

So, I need to open another bz?

Comment 16 Federico Simoncelli 2012-06-06 13:15:56 UTC
(In reply to comment #15)
> (In reply to comment #14)
> > The traceback and the logs you attached don't contain anything related to
> > comment #0. In fact the issue here is generated by something different:
> > 
> > Traceback (most recent call last):
> >   File "/usr/share/vdsm/clientIF.py", line 275, in _recoverExistingVms
> >     vmObj.preparePaths(vmObj.getConfDevices()[vm.DISK_DEVICES])
> >   File "/usr/share/vdsm/vm.py", line 402, in getConfDevices
> >     self.normalizeDrivesIndices(devices[DISK_DEVICES])
> >   File "/usr/share/vdsm/vm.py", line 540, in normalizeDrivesIndices
> >     if not self._usedIndices.has_key(drv['iface']):
> > KeyError: 'iface'
> 
> So, I need to open another bz?

Yes please.

Comment 17 Chao Yang 2012-06-07 03:30:18 UTC
(In reply to comment #16)
> (In reply to comment #15)
> > (In reply to comment #14)
> > > The traceback and the logs you attached don't contain anything related to
> > > comment #0. In fact the issue here is generated by something different:
> > > 
> > > Traceback (most recent call last):
> > >   File "/usr/share/vdsm/clientIF.py", line 275, in _recoverExistingVms
> > >     vmObj.preparePaths(vmObj.getConfDevices()[vm.DISK_DEVICES])
> > >   File "/usr/share/vdsm/vm.py", line 402, in getConfDevices
> > >     self.normalizeDrivesIndices(devices[DISK_DEVICES])
> > >   File "/usr/share/vdsm/vm.py", line 540, in normalizeDrivesIndices
> > >     if not self._usedIndices.has_key(drv['iface']):
> > > KeyError: 'iface'
> > 
> > So, I need to open another bz?
> 
> Yes please.

New bz entry opened at: https://bugzilla.redhat.com/show_bug.cgi?id=829566

Comment 22 Dafna Ron 2012-11-27 12:34:16 UTC
verified on si24.4 vdsm-4.9.6-44.0.el6_3.x86_64

Comment 24 errata-xmlrpc 2012-12-04 18:53:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-1508.html