Bug 1011472 - [vdsm] cannot recover VM upon vdsm restart after a disk has been hot plugged to it
[vdsm] cannot recover VM upon vdsm restart after a disk has been hot plugged ...
Status: CLOSED WORKSFORME
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.3.0
x86_64 Unspecified
high Severity medium
: ---
: 3.4.0
Assigned To: Yeela Kaplan
Aharon Canan
storage
:
Depends On:
Blocks: 1019461 rhev3.4beta 1142926
  Show dependency treegraph
 
Reported: 2013-09-24 07:17 EDT by Elad
Modified: 2016-02-10 15:16 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-02-23 08:10:10 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
abaron: Triaged+


Attachments (Terms of Use)
logs (1.90 MB, application/x-gzip)
2013-09-24 07:17 EDT, Elad
no flags Details
sanlock.log (103.32 KB, text/plain)
2013-09-24 08:23 EDT, Elad
no flags Details
vdsm.log (hotplug) (870.30 KB, application/x-xz)
2013-09-24 09:02 EDT, Elad
no flags Details

  None (edit)
Description Elad 2013-09-24 07:17:21 EDT
Created attachment 802186 [details]
logs

Description of problem:
cannot run a vm from 'paused' state after connectivity with storage resumed.

Version-Release number of selected component (if applicable):
vdsm-4.12.0-138.gitab256be.el6ev.x86_64

How reproducible:
unknown

Steps to Reproduce:
1. have a data center (iscsi) with 2 storage domains created from 2 different storage servers
2. run a VM (with a disk located on the non-master storage domain)
3. block connectivity from host to the non master storage domain using iptables
4. when VM enters to 'pause' state, resume connectivity to storage. 
5. when host is active again, try to activate the VM

Actual results:
Cannot start the vm from paused state. vdsm fails with:

clientIFinit::ERROR::2013-09-23 18:12:26,480::clientIF::465::vds::(_recoverExistingVms) Vm afac6a2c-2210-4f5d-a827-cadb046243d1 recovery failed
Traceback (most recent call last):
  File "/usr/share/vdsm/clientIF.py", line 462, in _recoverExistingVms
    vmObj.getConfDevices()[vm.DISK_DEVICES])
  File "/usr/share/vdsm/vm.py", line 1873, in getConfDevices
    self.normalizeDrivesIndices(devices[DISK_DEVICES])
  File "/usr/share/vdsm/vm.py", line 2058, in normalizeDrivesIndices
    if drv['iface'] not in self._usedIndices:
KeyError: 'iface'


Thread-386::ERROR::2013-09-23 18:27:07,690::BindingXMLRPC::993::vds::(wrapper) unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 979, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 227, in vmCont
    return vm.cont()
  File "/usr/share/vdsm/API.py", line 145, in cont
    return v.cont()
  File "/usr/share/vdsm/vm.py", line 2396, in cont
    self._underlyingCont()
  File "/usr/share/vdsm/vm.py", line 3440, in _underlyingCont
    hooks.before_vm_cont(self._dom.XMLDesc(0), self.conf)
AttributeError: 'NoneType' object has no attribute 'XMLDesc'


not sure whether it's a storage or a network issue.


Additional info: logs
Comment 1 Ayal Baron 2013-09-24 08:11:34 EDT
You failed to mention that vdsm restart

MainThread::INFO::2013-09-23 18:12:10,362::vdsm::101::vds::(run) (PID: 26042) I am the actual vdsm 4.12.0-138.gitab256be.el6ev nott-vds2.qa.lab.tlv.redhat.com (2.6.32-419.el6.x86_64)

please attach sanlock log.

Regardless, the issue is that the devices that are marshalled to disk do not contain the 'iface' key which is added in getConfDrives which is only called when running a VM.

This means you've hotplugged a device and it doesn't contain the key.
Simply running the following would reach the same result:
1. hotplug a device
2. restart vdsm

getConfDrive should not always add 'iface' to all devices and normalizeDrivesIndices should not assume all drives have the 'iface' key
Comment 2 Elad 2013-09-24 08:23:01 EDT
Created attachment 802203 [details]
sanlock.log

sanlock.log attached
Comment 3 Dan Kenigsberg 2013-09-24 08:56:56 EDT
Elad, do you have the vdsm.log of the vmHotplugDisk() call?

Engine Should have sent there the 'iface' element, which should have been either 'ide' or 'pci'. If not, it's an Engine bug (which can still be hacked around from vdsm side if impossible to fix properly on Engine).
Comment 4 Elad 2013-09-24 09:02:13 EDT
Created attachment 802216 [details]
vdsm.log (hotplug)

(In reply to Dan Kenigsberg from comment #3)
> Elad, do you have the vdsm.log of the vmHotplugDisk() call?
> 
> Engine Should have sent there the 'iface' element, which should have been
> either 'ide' or 'pci'. If not, it's an Engine bug (which can still be hacked
> around from vdsm side if impossible to fix properly on Engine).

Thread-7360::DEBUG::2013-09-23 15:11:56,223::BindingXMLRPC::974::vds::(wrapper) client [10.35.161.52]::call vmHotplugDisk with ({'vmId': 'afac6a2c-2210-4f5d-a827-cadb046243d1', 'drive': {'iface': 'virtio', 'format
': 'raw', 'optional': 'false', 'volumeID': '3257c0a1-9fd4-4882-ab77-afe3b6b23a2a', 'imageID': '09a8bc04-7fa6-4673-8fac-35926164024e', 'readonly': 'false', 'domainID': 'eff02bb9-cea8-4f89-a077-47f36be46197', 'devic
eId': '09a8bc04-7fa6-4673-8fac-35926164024e', 'poolID': 'b7cb43df-2955-47ed-b2a5-07ee6891c2b4', 'device': 'disk', 'shared': 'false', 'propagateErrors': 'off', 'type': 'disk'}},) {} flowID [5042b295]
Comment 6 Yeela Kaplan 2013-12-03 12:19:21 EST
Hi Elad,
Can you please provide libvirt logs in time of hotplug, to see the difference between the inforamtion reaching libvirt and the information saved in vdsm for the device.

In recovery, we obtain the vm info from libvirt. If the 'iface' attribute wasn't sent to libvirt, we can't recover correctly.

Thanks!
Comment 7 Ayal Baron 2013-12-08 10:44:13 EST
Hi Elad / Yeela, iiuc you are not able to reproduce this issue at all?
Comment 8 Elad 2013-12-08 11:55:03 EST
(In reply to Ayal Baron from comment #7)
> Hi Elad / Yeela, iiuc you are not able to reproduce this issue at all?

Tried to reproduce it according to the steps from comment #0 as it happened to me in the first place and also according to Ayal's suggestion (including VM migration).
Both doesn't seem to reproduce the issue
Comment 9 Ayal Baron 2014-02-23 08:10:10 EST
Closing according to comment 8, please reopen if happens again

Note You need to log in before you can comment on or make changes to this bug.