Bug 1400512

Summary: VM migration fails and target host becomes non-responsive
Product: [oVirt] ovirt-engine Reporter: Mor <mkalfon>
Component: BLL.StorageAssignee: Fred Rolland <frolland>
Status: CLOSED DUPLICATE QA Contact: Raz Tamir <ratamir>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.0CC: bugs, danken, gklein, mkalfon, mmirecki, myakove, tnisan
Target Milestone: ovirt-4.1.0-alphaKeywords: AutomationBlocker
Target Release: ---Flags: tnisan: ovirt-4.1?
gklein: blocker?
mkalfon: planning_ack?
mkalfon: devel_ack?
mkalfon: testing_ack?
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-05 14:04:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
vdsm and engine logs none

Description Mor 2016-12-01 12:14:37 UTC
Created attachment 1226797 [details]
vdsm and engine logs

Description of problem:
VM migration with OVN network (external network) sometimes fails and makes the target migration host non-responsive. It auto-recovers from this state after some time.

Version-Release number of selected component (if applicable):
oVirt Engine Version: 4.1.0-0.0.master.20161130231307.gita5ff876.el7.centos
vDSM version: vdsm-4.18.999-1019.gitd5042a5.el7.centos

How reproducible:
Sometimes.

Steps to Reproduce:
1. Create VM with two vNIC, one attached to ovirtmgmt and another to OVN network.
2. Run VM on host-0.
3. Migrate VM from host-0 to host-1.

Actual results:
Migration fails and host becomes non-responsive.

Expected results:
Migration should work.

Additional info:
When looking at the tar.gz, please note:

1. Migration scenario started on: 2016-12-01 08:23:41. 
2. host-0 is vdsm-source-host.log and host-1 is vdsm-target-host.log (as described in the above steps to reproduce).

Exception in vdsm.log of host-1 (target):

2016-12-01 08:23:41,331 ERROR (vm/4ce4cb51) [storage.TaskManager.Task] (Task='c313d245-6759-48e3-85a8-b1fea1083762') Unexpected error (task:870)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 877, in _run
    return fn(*args, **kargs)
  File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 3115, in prepareImage
    self.getPool(spUUID)
  File "/usr/share/vdsm/storage/hsm.py", line 326, in getPool
    raise se.StoragePoolUnknown(spUUID)
StoragePoolUnknown: Unknown pool id, pool not connected: ('225c5c69-a991-4d3c-929f-3853ec8e38cd',)
2016-12-01 08:23:41,332 INFO  (vm/4ce4cb51) [storage.TaskManager.Task] (Task='c313d245-6759-48e3-85a8-b1fea1083762') aborting: Task is aborted: 'Unknown pool id, pool not connected' - code 309 (task:1175)
2016-12-01 08:23:41,332 ERROR (vm/4ce4cb51) [storage.Dispatcher] {'status': {'message': "Unknown pool id, pool not connected: ('225c5c69-a991-4d3c-929f-3853ec8e38cd',)", 'code': 309}} (dispatcher:77)
2016-12-01 08:23:41,332 ERROR (vm/4ce4cb51) [virt.vm] (vmId='4ce4cb51-e754-4385-af99-d353346d01a5') The vm start process failed (vm:613)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 549, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/virt/vm.py", line 1899, in _run
    self._preparePathsForDrives(dev_spec_map[hwclass.DISK])
  File "/usr/share/vdsm/virt/vm.py", line 639, in _preparePathsForDrives
    drive['path'] = self.cif.prepareVolumePath(drive, self.id)
  File "/usr/share/vdsm/clientIF.py", line 320, in prepareVolumePath
    raise vm.VolumeError(drive)
VolumeError: Bad volume specification {'poolID': '225c5c69-a991-4d3c-929f-3853ec8e38cd', 'reqsize': '0', 'index': '0', 'iface': 'virtio', 'apparentsize': '75956224', 'specParams': {}, 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'readonly': 'False', 'shared': 'false', 'truesize': '76943360', 'type': 'disk', 'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volumeInfo': {'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'leasePath': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb.lease', 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'path': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb'}, 'format': 'cow', 'deviceId': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'address': {'slot': '0x08', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'disk', 'path': '/rhev/data-center/225c5c69-a991-4d3c-929f-3853ec8e38cd/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'propagateErrors': 'off', 'optional': 'false', 'name': 'vda', 'volumeID': 'be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'alias': 'virtio-disk0', 'discard': False, 'volumeChain': [{'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volType': 'path', 'leaseOffset': 0, 'volumeID': '61b68b01-27cd-44eb-aae8-c5e2313275b7', 'leasePath': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/61b68b01-27cd-44eb-aae8-c5e2313275b7.lease', 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'path': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/61b68b01-27cd-44eb-aae8-c5e2313275b7'}, {'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'leasePath': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb.lease', 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'path': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb'}]}
2016-12-01 08:23:41,334 INFO  (vm/4ce4cb51) [virt.vm] (vmId='4ce4cb51-e754-4385-af99-d353346d01a5') Changed state to Down: Bad volume specification {'poolID': '225c5c69-a991-4d3c-929f-3853ec8e38cd', 'reqsize': '0', 'index': '0', 'iface': 'virtio', 'apparentsize': '75956224', 'specParams': {}, 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'readonly': 'False', 'shared': 'false', 'truesize': '76943360', 'type': 'disk', 'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volumeInfo': {'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'leasePath': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb.lease', 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'path': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb'}, 'format': 'cow', 'deviceId': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'address': {'slot': '0x08', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'disk', 'path': '/rhev/data-center/225c5c69-a991-4d3c-929f-3853ec8e38cd/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'propagateErrors': 'off', 'optional': 'false', 'name': 'vda', 'volumeID': 'be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'alias': 'virtio-disk0', 'discard': False, 'volumeChain': [{'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volType': 'path', 'leaseOffset': 0, 'volumeID': '61b68b01-27cd-44eb-aae8-c5e2313275b7', 'leasePath': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/61b68b01-27cd-44eb-aae8-c5e2313275b7.lease', 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'path': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/61b68b01-27cd-44eb-aae8-c5e2313275b7'}, {'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'leasePath': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb.lease', 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'path': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb'}]} (code=1) (vm:1195)
2016-12-01 08:23:41,334 INFO  (vm/4ce4cb51) [virt.vm] (vmId='4ce4cb51-e754-4385-af99-d353346d01a5') Stopping connection (guestagent:430)
2016-12-01 08:23:42,363 INFO  (jsonrpc/5) [virt.vm] (vmId='4ce4cb51-e754-4385-af99-d353346d01a5') Release VM resources (vm:4048)
2016-12-01 08:23:42,364 WARN  (jsonrpc/5) [virt.vm] (vmId='4ce4cb51-e754-4385-af99-d353346d01a5') trying to set state to Powering down when already Down (vm:348)

Exception in engine.log:

2016-12-01 08:24:29,602+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-10) [66ec901f] Correlation ID: vms_syncAction_345c416d-84b1-434d, Job ID: 33a4e087-c79d-4c3c-a815-6e689267f9d5, Call Stack: null, Custom Event ID: -1, Message: Migration failed  (VM: golden_env_mixed_virtio_0, Source: host_mixed_1, Destination: host_mixed_2).
2016-12-01 08:24:29,611+02 INFO  [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (org.ovirt.thread.pool-6-thread-10) [66ec901f] Lock freed to object 'EngineLock:{exclusiveLocks='[4ce4cb51-e754-4385-af99-d353346d01a5=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName golden_env_mixed_virtio_0>]', sharedLocks='null'}'
2016-12-01 08:24:39,083+02 INFO  [org.ovirt.engine.core.bll.StopVmCommand] (default task-9) [vms_syncAction_0f9b8764-18a4-452d] Running command: StopVmCommand internal: false. Entities affected :  ID: 772bafef-72aa-45db-b574-0ee257c51685 Type: VMAction group STOP_VM with role type USER

Comment 1 Mor 2016-12-01 13:53:45 UTC
After discussing this and thinking about possible causes, I wanted to add that this could be related to loss of network connectivity on the management network during VM migration.We have ticket for OVS (OVN is based on OVS) migration BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1362495

Comment 2 Dan Kenigsberg 2016-12-04 12:55:14 UTC
I do not think it is at all related to the ovs issue, vdsm handles the two very differently.

Could you try ovn with vdsm of rhev-4.0.z? I suspect that you see a 4.1 regression in vdsm, where we have not touched vdsm for ovn.

Comment 3 Tal Nisan 2016-12-04 15:47:58 UTC
This part looks a bit troubling:
2016-12-01 08:23:41,332 INFO  (vm/4ce4cb51) [storage.TaskManager.Task] (Task='c313d245-6759-48e3-85a8-b1fea1083762') aborting: Task is aborted: 'Unknown pool id, pool not connected' - code 309 (task:1175)
2016-12-01 08:23:41,332 ERROR (vm/4ce4cb51) [storage.Dispatcher] {'status': {'message': "Unknown pool id, pool not connected: ('225c5c69-a991-4d3c-929f-3853ec8e38cd',)", 'code': 309}} (dispatcher:77)
2016-12-01 08:23:41,332 ERROR (vm/4ce4cb51) [virt.vm] (vmId='4ce4cb51-e754-4385-af99-d353346d01a5') The vm start process failed (vm:613)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 549, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/virt/vm.py", line 1899, in _run
    self._preparePathsForDrives(dev_spec_map[hwclass.DISK])
  File "/usr/share/vdsm/virt/vm.py", line 639, in _preparePathsForDrives
    drive['path'] = self.cif.prepareVolumePath(drive, self.id)
  File "/usr/share/vdsm/clientIF.py", line 320, in prepareVolumePath
    raise vm.VolumeError(drive)
VolumeError: Bad volume specification 


Fred, please have a look

Comment 4 Mor 2016-12-05 09:09:53 UTC
(In reply to Dan Kenigsberg from comment #2)
> I do not think it is at all related to the ovs issue, vdsm handles the two
> very differently.
> 
> Could you try ovn with vdsm of rhev-4.0.z? I suspect that you see a 4.1
> regression in vdsm, where we have not touched vdsm for ovn.

Dan, I will check if we have rhev 4.0.x to run the test on and let you know. OVN provider and driver are fully supported in 4.0.x? There was a bug in REST with openstackproviders, so I will have to check it manually.

Comment 5 Mor 2016-12-05 14:04:53 UTC

*** This bug has been marked as a duplicate of bug 1401523 ***