1011472 – [vdsm] cannot recover VM upon vdsm restart after a disk has been hot plugged to it

Bug 1011472 - [vdsm] cannot recover VM upon vdsm restart after a disk has been hot plugged to it

Summary: [vdsm] cannot recover VM upon vdsm restart after a disk has been hot plugged ...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.3.0
Hardware:	x86_64
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	3.4.0
Assignee:	Yeela Kaplan
QA Contact:	Aharon Canan
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	1019461 rhev3.4beta 1142926
TreeView+	depends on / blocked

Reported:	2013-09-24 11:17 UTC by Elad
Modified:	2016-02-10 20:16 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2014-02-23 13:10:10 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:
Flags:	abaron: Triaged+

Attachments	(Terms of Use)
logs (1.90 MB, application/x-gzip) 2013-09-24 11:17 UTC, Elad	no flags	Details
sanlock.log (103.32 KB, text/plain) 2013-09-24 12:23 UTC, Elad	no flags	Details
vdsm.log (hotplug) (870.30 KB, application/x-xz) 2013-09-24 13:02 UTC, Elad	no flags	Details
View All

Description Elad 2013-09-24 11:17:21 UTC

Created attachment 802186 [details]
logs

Description of problem:
cannot run a vm from 'paused' state after connectivity with storage resumed.

Version-Release number of selected component (if applicable):
vdsm-4.12.0-138.gitab256be.el6ev.x86_64

How reproducible:
unknown

Steps to Reproduce:
1. have a data center (iscsi) with 2 storage domains created from 2 different storage servers
2. run a VM (with a disk located on the non-master storage domain)
3. block connectivity from host to the non master storage domain using iptables
4. when VM enters to 'pause' state, resume connectivity to storage. 
5. when host is active again, try to activate the VM

Actual results:
Cannot start the vm from paused state. vdsm fails with:

clientIFinit::ERROR::2013-09-23 18:12:26,480::clientIF::465::vds::(_recoverExistingVms) Vm afac6a2c-2210-4f5d-a827-cadb046243d1 recovery failed
Traceback (most recent call last):
  File "/usr/share/vdsm/clientIF.py", line 462, in _recoverExistingVms
    vmObj.getConfDevices()[vm.DISK_DEVICES])
  File "/usr/share/vdsm/vm.py", line 1873, in getConfDevices
    self.normalizeDrivesIndices(devices[DISK_DEVICES])
  File "/usr/share/vdsm/vm.py", line 2058, in normalizeDrivesIndices
    if drv['iface'] not in self._usedIndices:
KeyError: 'iface'


Thread-386::ERROR::2013-09-23 18:27:07,690::BindingXMLRPC::993::vds::(wrapper) unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 979, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 227, in vmCont
    return vm.cont()
  File "/usr/share/vdsm/API.py", line 145, in cont
    return v.cont()
  File "/usr/share/vdsm/vm.py", line 2396, in cont
    self._underlyingCont()
  File "/usr/share/vdsm/vm.py", line 3440, in _underlyingCont
    hooks.before_vm_cont(self._dom.XMLDesc(0), self.conf)
AttributeError: 'NoneType' object has no attribute 'XMLDesc'


not sure whether it's a storage or a network issue.


Additional info: logs

Comment 1 Ayal Baron 2013-09-24 12:11:34 UTC

You failed to mention that vdsm restart

MainThread::INFO::2013-09-23 18:12:10,362::vdsm::101::vds::(run) (PID: 26042) I am the actual vdsm 4.12.0-138.gitab256be.el6ev nott-vds2.qa.lab.tlv.redhat.com (2.6.32-419.el6.x86_64)

please attach sanlock log.

Regardless, the issue is that the devices that are marshalled to disk do not contain the 'iface' key which is added in getConfDrives which is only called when running a VM.

This means you've hotplugged a device and it doesn't contain the key.
Simply running the following would reach the same result:
1. hotplug a device
2. restart vdsm

getConfDrive should not always add 'iface' to all devices and normalizeDrivesIndices should not assume all drives have the 'iface' key

Comment 2 Elad 2013-09-24 12:23:01 UTC

Created attachment 802203 [details]
sanlock.log

sanlock.log attached

Comment 3 Dan Kenigsberg 2013-09-24 12:56:56 UTC

Elad, do you have the vdsm.log of the vmHotplugDisk() call?

Engine Should have sent there the 'iface' element, which should have been either 'ide' or 'pci'. If not, it's an Engine bug (which can still be hacked around from vdsm side if impossible to fix properly on Engine).

Comment 4 Elad 2013-09-24 13:02:13 UTC

Created attachment 802216 [details]
vdsm.log (hotplug)

(In reply to Dan Kenigsberg from comment #3)
> Elad, do you have the vdsm.log of the vmHotplugDisk() call?
> 
> Engine Should have sent there the 'iface' element, which should have been
> either 'ide' or 'pci'. If not, it's an Engine bug (which can still be hacked
> around from vdsm side if impossible to fix properly on Engine).

Thread-7360::DEBUG::2013-09-23 15:11:56,223::BindingXMLRPC::974::vds::(wrapper) client [10.35.161.52]::call vmHotplugDisk with ({'vmId': 'afac6a2c-2210-4f5d-a827-cadb046243d1', 'drive': {'iface': 'virtio', 'format
': 'raw', 'optional': 'false', 'volumeID': '3257c0a1-9fd4-4882-ab77-afe3b6b23a2a', 'imageID': '09a8bc04-7fa6-4673-8fac-35926164024e', 'readonly': 'false', 'domainID': 'eff02bb9-cea8-4f89-a077-47f36be46197', 'devic
eId': '09a8bc04-7fa6-4673-8fac-35926164024e', 'poolID': 'b7cb43df-2955-47ed-b2a5-07ee6891c2b4', 'device': 'disk', 'shared': 'false', 'propagateErrors': 'off', 'type': 'disk'}},) {} flowID [5042b295]

Comment 6 Yeela Kaplan 2013-12-03 17:19:21 UTC

Hi Elad,
Can you please provide libvirt logs in time of hotplug, to see the difference between the inforamtion reaching libvirt and the information saved in vdsm for the device.

In recovery, we obtain the vm info from libvirt. If the 'iface' attribute wasn't sent to libvirt, we can't recover correctly.

Thanks!

Comment 7 Ayal Baron 2013-12-08 15:44:13 UTC

Hi Elad / Yeela, iiuc you are not able to reproduce this issue at all?

Comment 8 Elad 2013-12-08 16:55:03 UTC

(In reply to Ayal Baron from comment #7)
> Hi Elad / Yeela, iiuc you are not able to reproduce this issue at all?

Tried to reproduce it according to the steps from comment #0 as it happened to me in the first place and also according to Ayal's suggestion (including VM migration).
Both doesn't seem to reproduce the issue

Comment 9 Ayal Baron 2014-02-23 13:10:10 UTC

Closing according to comment 8, please reopen if happens again

Note You need to log in before you can comment on or make changes to this bug.