1666498 – nova live-migration end in error with UnexpectedTaskStateError: Conflict updating instance.

Bug 1666498 - nova live-migration end in error with UnexpectedTaskStateError: Conflict updating instance.

Summary: nova live-migration end in error with UnexpectedTaskStateError: Conflict upda...

Keywords:
Status:	CLOSED DUPLICATE of bug 1636280
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	10.0 (Newton)
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	OSP DFG:Compute
QA Contact:	OSP DFG:Compute
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-15 20:22 UTC by Siggy Sigwald
Modified:	2023-03-21 19:10 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-18 17:39:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Siggy Sigwald 2019-01-15 20:22:29 UTC

Description of problem:
Live migration ends in error with: 


 fault                                | {"message": "Conflict updating instance 12228f0a-0b72-4139-b2e3-522f762371a1. Expected: {'task_state': [u'migrating']}. Actual: {'task_state': None}", "code": 500, "detail
s": "  File \"/usr/lib/python2.7/site-packages/nova/compute/manager.py\", line 204, in decorated_function | 

In the source computer this appears:

2019-01-10 13:10:45.603 2624 ERROR nova.virt.libvirt.driver [req-88db2624-2401-4427-9b3a-79293bf986c3 a1e302a33a9a4ee8aa5b1bc794217974 3e28363adf4c4376aef755571fb3e61e - - -] [instance: 12228f0a-0b72-4139-b2e3-522f762371a1] Error from libvirt during undefine. Code=42 Error=Domain not found: no domain with matching uuid '12228f0a-0b72-4139-b2e3-522f762371a1' (instance-00008bf2)
2019-01-10 13:10:45.692 2624 WARNING nova.virt.libvirt.driver [req-88db2624-2401-4427-9b3a-79293bf986c3 a1e302a33a9a4ee8aa5b1bc794217974 3e28363adf4c4376aef755571fb3e61e - - -] [instance: 12228f0a-0b72-4139-b2e3-522f762371a1] Error monitoring migration: Domain not found: no domain with matching uuid '12228f0a-0b72-4139-b2e3-522f762371a1' (instance-00008bf2)
2019-01-10 13:10:45.692 2624 ERROR nova.virt.libvirt.driver [instance: 12228f0a-0b72-4139-b2e3-522f762371a1] Traceback (most recent call last):
2019-01-10 13:10:45.692 2624 ERROR nova.virt.libvirt.driver [instance: 12228f0a-0b72-4139-b2e3-522f762371a1]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6452, in _live_migration
2019-01-10 13:10:45.692 2624 ERROR nova.virt.libvirt.driver [instance: 12228f0a-0b72-4139-b2e3-522f762371a1]     finish_event, disk_paths)
2019-01-10 13:10:45.692 2624 ERROR nova.virt.libvirt.driver [instance: 12228f0a-0b72-4139-b2e3-522f762371a1]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 6362, in _live_migration_monitor

Comment 2 Artom Lifshitz 2019-01-17 17:30:34 UTC

We recently merged [1] for bz 1636280, it looks like it should band-aid that 'Domain not found' libvirt error. It was merged sorta speculatively because I wasn't able to actually reproduce the race, and the customer for bz 1636280 never did get back to use about testing the build. What we do with this current bz is up to the customer - either wait for [1] to appear in a z-stream, or we can provide openstack-nova-14.1.0-36.el7ost as a sort of hotfix that the customer can test and see whether it really addresses his race.

[1] https://code.engineering.redhat.com/gerrit/#/c/154627/

Comment 4 Matthew Booth 2019-01-18 17:39:17 UTC

The sequence here is (some tasks omitted for brevity):

Libvirt sends VIR_DOMAIN_JOB_COMPLETED to monitor on source compute.
Source compute calls ComputeManager._post_live_migration

Asynchronously, libvirt logs:

2019-01-10 16:55:31.773+0000: initiating migration
2019-01-10 16:56:03.977+0000: shutting down, reason=migrated
2019-01-10T16:56:04.021523Z qemu-kvm: terminating on signal 15 from pid 2736 (/usr/sbin/libvirtd)

Note the timestamp when qemu shuts down.

Source compute casts (therefore runs async) post_live_migration_at_destination on the destination.

2019-01-10 16:56:07.922: Source compute attempts to delete the domain, which fails because the domain was shut down over a second ago by libvirt.

Source compute puts the instance and migration into an error state which clears its task state.
Destination compute running post_live_migration_at_destination, fails to save instance object, and therefore new instance.host, because task state was cleared by the source compute.

================

There are at least 2 races here:

1. Nova races with libvirt to destroy the instance post migration after receiving VIR_DOMAIN_JOB_COMPLETED.
2. In response to 1, nova source races with nova dest by concurrently modifying the instance object. This leaves the instance running on the dest, but with the instance record pointing to the source.

Apparently Nova normally wins the first race, or this would never work. Upstream worked around this issue with change I23ed9819061bfa436b12180110666c5b8c3e0f70, which causes nova not to error on undefined if the domain isn't present, which probably makes sense anyway:

https://review.openstack.org/430400

Unfortunately, in an effort to fix this bug, upstream nova *introduced* the second race in change Idfdce9e7dd8106af01db0358ada15737cb846395:

https://review.openstack.org/430404

This problem is still present upstream, although various failure modes which might trigger it have been worked around separately.

================

For this specific issue, the customer only needs a fix for the initial problem. If the cleanup error hadn't occurred, none of the other problems would have subsequently occurred either. This is already in progress in bug 1636280.

There are some other fixes which would make this process more robust to failure, though:

1. The source compute could update instance.host to the destination compute immediately the migration completes and before doing any cleanup. That way in the event of failure nova has a record of where it's actually running.

2. The source compute should not touch the instance object under any circumstances after calling post_live_migration_at_destination(). It can still set the migration to an error state if required.

3. The source compute could join the _live_migration thread before running cleanup on the source. This way it could be sure that the instance is no longer running, and it would not be necessary to cleanup the running domain at all.

*** This bug has been marked as a duplicate of bug 1636280 ***

Note You need to log in before you can comment on or make changes to this bug.