Bug 1636102 - Update instance host and task state when post live migration fails
Summary: Update instance host and task state when post live migration fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z10
: 10.0 (Newton)
Assignee: Artom Lifshitz
QA Contact: OSP DFG:Compute
URL:
Whiteboard:
: 1289858 1630771 (view as bug list)
Depends On:
Blocks: 1269615
TreeView+ depends on / blocked
 
Reported: 2018-10-04 13:10 UTC by Pablo Caruana
Modified: 2023-03-21 19:00 UTC (History)
19 users (show)

Fixed In Version: openstack-nova-14.1.0-35.el7ost
Doc Type: Bug Fix
Doc Text:
This update fixes issues that could leave an instance record in an inconsistent state after failure of a volume API (Cinder) call during a live migration. Prior to this update, if the volume API call failed, the live migration didn't finish correctly and the instance record was left in an inconsistent state in the database. With this update, the instance record is updated correctly in such cases. The volume API error is logged. Administrators may need to clean up the instance's volume attachments.
Clone Of:
Environment:
Last Closed: 2019-01-16 17:09:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1628606 0 None None None 2018-10-04 13:10:19 UTC
OpenStack gerrit 609517 0 'None' MERGED Handle volume API failure in _post_live_migration 2021-01-08 02:46:04 UTC
Red Hat Issue Tracker OSP-11661 0 None None None 2021-12-10 17:47:31 UTC
Red Hat Knowledge Base (Solution) 2070503 0 None None None 2018-10-10 19:13:47 UTC
Red Hat Knowledge Base (Solution) 3676151 0 None None None 2018-11-02 08:20:56 UTC
Red Hat Product Errata RHBA-2019:0074 0 None None None 2019-01-16 17:09:13 UTC

Description Pablo Caruana 2018-10-04 13:10:19 UTC
Update instance host and task state when post live migration fails

If a live migration fails during the post processing it can lead to
the instance being shutdown on the source node and left in a migrating
task state. The instance is now running on the target node so the
instance host and task state should be updated.

Comment 7 Artom Lifshitz 2018-10-09 23:47:43 UTC
I talked with Joachim on IRC this morning - we mentioned coming up with a workaround, since the upstream fix might be touchy and take a while.

Having looked over the sosreports and code, the only workaround I can come up with is manually updating instance.host in the database once a situation like this has been detected. It sucks, but once the system is in an inconsistent state, I don't see another way of fixing it.

Also, be aware that because post_live_migration failed on the source, some other things didn't get done:
* VIFs on the source weren't unplugged.
* Port bindings weren't updated to reflect the instance being on the destination

There are a bunch of other bugs that I believe are either identical or  similar enough that it's worth it to analyse them all before coming up with a fix. That's my next step, and then I'll post a patch.

Comment 8 Artom Lifshitz 2018-10-10 14:21:46 UTC
Since Jaochim asked on IRC, here's the list of BZs that I think are related/duplicates:

* bz 1289858
* bz 1630771
* bz 1636280

Comment 9 Artom Lifshitz 2018-10-10 19:07:56 UTC
Bz 1289858 and bz 1630771 are indeed identical, and are both caused by failures when calling Cinder - either Cinder itself, or something "in front" of Cinder, like Keystone or HAProxy.

I've proposed [1] upstream. It's really "dumb", but by virtue of being "dumb" it's also simple and minimizes potential side effects. Since the single common root cause in all 3 bugs is the external API call, I've just wrapped it in a try/except. Let's see what the community thinks.

[1] https://review.openstack.org/609517

PS: Bz 1636280 is similar but unrelated, and will need a different fix.

Comment 10 Artom Lifshitz 2018-10-10 19:13:47 UTC
*** Bug 1630771 has been marked as a duplicate of this bug. ***

Comment 11 Artom Lifshitz 2018-10-10 19:14:39 UTC
*** Bug 1289858 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2019-01-16 17:09:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0074


Note You need to log in before you can comment on or make changes to this bug.