| Summary: | VM migration fails and target host becomes non-responsive | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Mor <mkalfon> | ||||
| Component: | BLL.Storage | Assignee: | Fred Rolland <frolland> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Raz Tamir <ratamir> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 4.1.0 | CC: | bugs, danken, gklein, mkalfon, mmirecki, myakove, tnisan | ||||
| Target Milestone: | ovirt-4.1.0-alpha | Keywords: | AutomationBlocker | ||||
| Target Release: | --- | Flags: | tnisan:
ovirt-4.1?
gklein: blocker? mkalfon: planning_ack? mkalfon: devel_ack? mkalfon: testing_ack? |
||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-12-05 14:04:53 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
After discussing this and thinking about possible causes, I wanted to add that this could be related to loss of network connectivity on the management network during VM migration.We have ticket for OVS (OVN is based on OVS) migration BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1362495 I do not think it is at all related to the ovs issue, vdsm handles the two very differently. Could you try ovn with vdsm of rhev-4.0.z? I suspect that you see a 4.1 regression in vdsm, where we have not touched vdsm for ovn. This part looks a bit troubling:
2016-12-01 08:23:41,332 INFO (vm/4ce4cb51) [storage.TaskManager.Task] (Task='c313d245-6759-48e3-85a8-b1fea1083762') aborting: Task is aborted: 'Unknown pool id, pool not connected' - code 309 (task:1175)
2016-12-01 08:23:41,332 ERROR (vm/4ce4cb51) [storage.Dispatcher] {'status': {'message': "Unknown pool id, pool not connected: ('225c5c69-a991-4d3c-929f-3853ec8e38cd',)", 'code': 309}} (dispatcher:77)
2016-12-01 08:23:41,332 ERROR (vm/4ce4cb51) [virt.vm] (vmId='4ce4cb51-e754-4385-af99-d353346d01a5') The vm start process failed (vm:613)
Traceback (most recent call last):
File "/usr/share/vdsm/virt/vm.py", line 549, in _startUnderlyingVm
self._run()
File "/usr/share/vdsm/virt/vm.py", line 1899, in _run
self._preparePathsForDrives(dev_spec_map[hwclass.DISK])
File "/usr/share/vdsm/virt/vm.py", line 639, in _preparePathsForDrives
drive['path'] = self.cif.prepareVolumePath(drive, self.id)
File "/usr/share/vdsm/clientIF.py", line 320, in prepareVolumePath
raise vm.VolumeError(drive)
VolumeError: Bad volume specification
Fred, please have a look
(In reply to Dan Kenigsberg from comment #2) > I do not think it is at all related to the ovs issue, vdsm handles the two > very differently. > > Could you try ovn with vdsm of rhev-4.0.z? I suspect that you see a 4.1 > regression in vdsm, where we have not touched vdsm for ovn. Dan, I will check if we have rhev 4.0.x to run the test on and let you know. OVN provider and driver are fully supported in 4.0.x? There was a bug in REST with openstackproviders, so I will have to check it manually. *** This bug has been marked as a duplicate of bug 1401523 *** |
Created attachment 1226797 [details] vdsm and engine logs Description of problem: VM migration with OVN network (external network) sometimes fails and makes the target migration host non-responsive. It auto-recovers from this state after some time. Version-Release number of selected component (if applicable): oVirt Engine Version: 4.1.0-0.0.master.20161130231307.gita5ff876.el7.centos vDSM version: vdsm-4.18.999-1019.gitd5042a5.el7.centos How reproducible: Sometimes. Steps to Reproduce: 1. Create VM with two vNIC, one attached to ovirtmgmt and another to OVN network. 2. Run VM on host-0. 3. Migrate VM from host-0 to host-1. Actual results: Migration fails and host becomes non-responsive. Expected results: Migration should work. Additional info: When looking at the tar.gz, please note: 1. Migration scenario started on: 2016-12-01 08:23:41. 2. host-0 is vdsm-source-host.log and host-1 is vdsm-target-host.log (as described in the above steps to reproduce). Exception in vdsm.log of host-1 (target): 2016-12-01 08:23:41,331 ERROR (vm/4ce4cb51) [storage.TaskManager.Task] (Task='c313d245-6759-48e3-85a8-b1fea1083762') Unexpected error (task:870) Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 877, in _run return fn(*args, **kargs) File "/usr/lib/python2.7/site-packages/vdsm/logUtils.py", line 50, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 3115, in prepareImage self.getPool(spUUID) File "/usr/share/vdsm/storage/hsm.py", line 326, in getPool raise se.StoragePoolUnknown(spUUID) StoragePoolUnknown: Unknown pool id, pool not connected: ('225c5c69-a991-4d3c-929f-3853ec8e38cd',) 2016-12-01 08:23:41,332 INFO (vm/4ce4cb51) [storage.TaskManager.Task] (Task='c313d245-6759-48e3-85a8-b1fea1083762') aborting: Task is aborted: 'Unknown pool id, pool not connected' - code 309 (task:1175) 2016-12-01 08:23:41,332 ERROR (vm/4ce4cb51) [storage.Dispatcher] {'status': {'message': "Unknown pool id, pool not connected: ('225c5c69-a991-4d3c-929f-3853ec8e38cd',)", 'code': 309}} (dispatcher:77) 2016-12-01 08:23:41,332 ERROR (vm/4ce4cb51) [virt.vm] (vmId='4ce4cb51-e754-4385-af99-d353346d01a5') The vm start process failed (vm:613) Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 549, in _startUnderlyingVm self._run() File "/usr/share/vdsm/virt/vm.py", line 1899, in _run self._preparePathsForDrives(dev_spec_map[hwclass.DISK]) File "/usr/share/vdsm/virt/vm.py", line 639, in _preparePathsForDrives drive['path'] = self.cif.prepareVolumePath(drive, self.id) File "/usr/share/vdsm/clientIF.py", line 320, in prepareVolumePath raise vm.VolumeError(drive) VolumeError: Bad volume specification {'poolID': '225c5c69-a991-4d3c-929f-3853ec8e38cd', 'reqsize': '0', 'index': '0', 'iface': 'virtio', 'apparentsize': '75956224', 'specParams': {}, 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'readonly': 'False', 'shared': 'false', 'truesize': '76943360', 'type': 'disk', 'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volumeInfo': {'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'leasePath': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb.lease', 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'path': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb'}, 'format': 'cow', 'deviceId': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'address': {'slot': '0x08', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'disk', 'path': '/rhev/data-center/225c5c69-a991-4d3c-929f-3853ec8e38cd/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'propagateErrors': 'off', 'optional': 'false', 'name': 'vda', 'volumeID': 'be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'alias': 'virtio-disk0', 'discard': False, 'volumeChain': [{'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volType': 'path', 'leaseOffset': 0, 'volumeID': '61b68b01-27cd-44eb-aae8-c5e2313275b7', 'leasePath': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/61b68b01-27cd-44eb-aae8-c5e2313275b7.lease', 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'path': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/61b68b01-27cd-44eb-aae8-c5e2313275b7'}, {'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'leasePath': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb.lease', 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'path': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb'}]} 2016-12-01 08:23:41,334 INFO (vm/4ce4cb51) [virt.vm] (vmId='4ce4cb51-e754-4385-af99-d353346d01a5') Changed state to Down: Bad volume specification {'poolID': '225c5c69-a991-4d3c-929f-3853ec8e38cd', 'reqsize': '0', 'index': '0', 'iface': 'virtio', 'apparentsize': '75956224', 'specParams': {}, 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'readonly': 'False', 'shared': 'false', 'truesize': '76943360', 'type': 'disk', 'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volumeInfo': {'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'leasePath': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb.lease', 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'path': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb'}, 'format': 'cow', 'deviceId': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'address': {'slot': '0x08', 'bus': '0x00', 'domain': '0x0000', 'type': 'pci', 'function': '0x0'}, 'device': 'disk', 'path': '/rhev/data-center/225c5c69-a991-4d3c-929f-3853ec8e38cd/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'propagateErrors': 'off', 'optional': 'false', 'name': 'vda', 'volumeID': 'be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'alias': 'virtio-disk0', 'discard': False, 'volumeChain': [{'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volType': 'path', 'leaseOffset': 0, 'volumeID': '61b68b01-27cd-44eb-aae8-c5e2313275b7', 'leasePath': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/61b68b01-27cd-44eb-aae8-c5e2313275b7.lease', 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'path': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/61b68b01-27cd-44eb-aae8-c5e2313275b7'}, {'domainID': '679a96d3-bd26-4e3e-ba16-7992811bf566', 'volType': 'path', 'leaseOffset': 0, 'volumeID': 'be2c0748-2108-47c2-9ab2-2bacb60c9aeb', 'leasePath': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb.lease', 'imageID': '0ed9b57a-5f09-44cc-8721-007201443b8b', 'path': '/rhev/data-center/mnt/10.35.110.11:_RHV__NFS_Network__ge_network__ge__2__nfs__0/679a96d3-bd26-4e3e-ba16-7992811bf566/images/0ed9b57a-5f09-44cc-8721-007201443b8b/be2c0748-2108-47c2-9ab2-2bacb60c9aeb'}]} (code=1) (vm:1195) 2016-12-01 08:23:41,334 INFO (vm/4ce4cb51) [virt.vm] (vmId='4ce4cb51-e754-4385-af99-d353346d01a5') Stopping connection (guestagent:430) 2016-12-01 08:23:42,363 INFO (jsonrpc/5) [virt.vm] (vmId='4ce4cb51-e754-4385-af99-d353346d01a5') Release VM resources (vm:4048) 2016-12-01 08:23:42,364 WARN (jsonrpc/5) [virt.vm] (vmId='4ce4cb51-e754-4385-af99-d353346d01a5') trying to set state to Powering down when already Down (vm:348) Exception in engine.log: 2016-12-01 08:24:29,602+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-10) [66ec901f] Correlation ID: vms_syncAction_345c416d-84b1-434d, Job ID: 33a4e087-c79d-4c3c-a815-6e689267f9d5, Call Stack: null, Custom Event ID: -1, Message: Migration failed (VM: golden_env_mixed_virtio_0, Source: host_mixed_1, Destination: host_mixed_2). 2016-12-01 08:24:29,611+02 INFO [org.ovirt.engine.core.bll.MigrateVmToServerCommand] (org.ovirt.thread.pool-6-thread-10) [66ec901f] Lock freed to object 'EngineLock:{exclusiveLocks='[4ce4cb51-e754-4385-af99-d353346d01a5=<VM, ACTION_TYPE_FAILED_VM_IS_BEING_MIGRATED$VmName golden_env_mixed_virtio_0>]', sharedLocks='null'}' 2016-12-01 08:24:39,083+02 INFO [org.ovirt.engine.core.bll.StopVmCommand] (default task-9) [vms_syncAction_0f9b8764-18a4-452d] Running command: StopVmCommand internal: false. Entities affected : ID: 772bafef-72aa-45db-b574-0ee257c51685 Type: VMAction group STOP_VM with role type USER