Bug 1041160 - [RFE][nova]: Snapshot state consistency between glance and nova
Summary: [RFE][nova]: Snapshot state consistency between glance and nova
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: RFEs
Version: unspecified
Hardware: Unspecified
OS: Unspecified
low
unspecified
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Ami Jeain
URL: https://blueprints.launchpad.net/nova...
Whiteboard: upstream_milestone_none upstream_stat...
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-12-12 13:49 UTC by RHOS Integration
Modified: 2019-09-09 13:43 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-03-19 17:07:07 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description RHOS Integration 2013-12-12 13:49:12 UTC
Cloned from launchpad blueprint https://blueprints.launchpad.net/nova/+spec/glance-snapshot-tasks.

Description:

In order to get forward progress on the discussion @ the summit about https://etherpad.openstack.org/p/icehouse-summit-image-state-consistency we propose to begin movement of the snapshot code that exists in the nova-compute manager to a location (conductor) where it can be executed on-behalf of the nova-compute that has the VM that is to be snapshot (and uploaded to glance) so that the snapshot state can be recovered from reliably (or resumed) so that the VM that is snapshotted can end up in a agreed up-on (and well-defined) state (not ERROR or IMAGE-UPLOADING). This will help avoid the state inconsistency that happens when the upload is partially completed due to a service outage (or other network partition), allowing for the interaction between glance and nova to be a reliable one.

This will likely involve the following steps:

0. Document and understand the current workflow and its deficiencies.

1. Moving the conduction of the snapshot workflow to the conductor (reducing whats in nova compute to a smaller set). Handle the new and current error states of that workflow in the conductor that result from this modified workflow.

2. After getting the basics of conducting working in the conductor, support detection of stalled or erred out snapshot uploads into glance by having the nova<->glance interaction go through a more well defined workflow state-machine. This will likely involve going through a set of states involving [LOCAL_SNAPSHOT_STARTED, LOCAL_SNAPSHOT_COMPLETE, UPLOAD_BEGIN, UPLOAD_%s_COMPLETE, UPLOAD_COMPLETE, IMAGE_ACTIVE] for the snapshot happy path.

2a. For the error path there will need to be a mechanism to signal to the user of the snapshot process that can be queried via glance or nova to know at which stage nova is in the snapshotting process. If the conductor processing the workflow has stalled it would be nice to be able to have glance know this via some type of 'last state change' timestamp (this can be useful to let the user know when the last state change occurred). If the conductor has not stalled the then 'UPLOAD_%s_COMPLETE'  (this may be a new state or a status of an existing state, or something else entirely) which should have the percentage of the upload completion has occurred will be useful to expose to clients that the upload is not complete.
2aa. In general this whole 'liveness' detection would be better handled by some type of 'shared' liveness storage system (for example a shared agreed upon path in zookeeper that can be used to know if nova has died during upload from glance), but a percent complete (and associated last state change timestamp?) as well as the request connection/socket itself is a good start.

3. Support the above snapshot workflow running in the conductor via taskflow (which brings in resumption, recovery, state tracking and various other benefits) by bringing in taskflow to aid in this process
  - https://blueprints.launchpad.net/nova/+spec/glance-snapshot-tasks-taskflow

Specification URL (additional information):

None


Note You need to log in before you can comment on or make changes to this bug.