Bug 1041160

Summary: [RFE][nova]: Snapshot state consistency between glance and nova
Product: Red Hat OpenStack Reporter: RHOS Integration <rhos-integ>
Component: RFEsAssignee: RHOS Maint <rhos-maint>
Status: CLOSED UPSTREAM QA Contact: Ami Jeain <ajeain>
Severity: unspecified Docs Contact:
Priority: low    
Version: unspecifiedCC: markmc, ndipanov, sgordon, yeylon
Target Milestone: ---Keywords: FutureFeature, Triaged, Upstream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
URL: https://blueprints.launchpad.net/nova/+spec/glance-snapshot-tasks
Whiteboard: upstream_milestone_none upstream_status_unknown upstream_definition_drafting
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-03-19 17:07:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description RHOS Integration 2013-12-12 13:49:12 UTC
Cloned from launchpad blueprint https://blueprints.launchpad.net/nova/+spec/glance-snapshot-tasks.

Description:

In order to get forward progress on the discussion @ the summit about https://etherpad.openstack.org/p/icehouse-summit-image-state-consistency we propose to begin movement of the snapshot code that exists in the nova-compute manager to a location (conductor) where it can be executed on-behalf of the nova-compute that has the VM that is to be snapshot (and uploaded to glance) so that the snapshot state can be recovered from reliably (or resumed) so that the VM that is snapshotted can end up in a agreed up-on (and well-defined) state (not ERROR or IMAGE-UPLOADING). This will help avoid the state inconsistency that happens when the upload is partially completed due to a service outage (or other network partition), allowing for the interaction between glance and nova to be a reliable one.

This will likely involve the following steps:

0. Document and understand the current workflow and its deficiencies.

1. Moving the conduction of the snapshot workflow to the conductor (reducing whats in nova compute to a smaller set). Handle the new and current error states of that workflow in the conductor that result from this modified workflow.

2. After getting the basics of conducting working in the conductor, support detection of stalled or erred out snapshot uploads into glance by having the nova<->glance interaction go through a more well defined workflow state-machine. This will likely involve going through a set of states involving [LOCAL_SNAPSHOT_STARTED, LOCAL_SNAPSHOT_COMPLETE, UPLOAD_BEGIN, UPLOAD_%s_COMPLETE, UPLOAD_COMPLETE, IMAGE_ACTIVE] for the snapshot happy path.

2a. For the error path there will need to be a mechanism to signal to the user of the snapshot process that can be queried via glance or nova to know at which stage nova is in the snapshotting process. If the conductor processing the workflow has stalled it would be nice to be able to have glance know this via some type of 'last state change' timestamp (this can be useful to let the user know when the last state change occurred). If the conductor has not stalled the then 'UPLOAD_%s_COMPLETE'  (this may be a new state or a status of an existing state, or something else entirely) which should have the percentage of the upload completion has occurred will be useful to expose to clients that the upload is not complete.
2aa. In general this whole 'liveness' detection would be better handled by some type of 'shared' liveness storage system (for example a shared agreed upon path in zookeeper that can be used to know if nova has died during upload from glance), but a percent complete (and associated last state change timestamp?) as well as the request connection/socket itself is a good start.

3. Support the above snapshot workflow running in the conductor via taskflow (which brings in resumption, recovery, state tracking and various other benefits) by bringing in taskflow to aid in this process
  - https://blueprints.launchpad.net/nova/+spec/glance-snapshot-tasks-taskflow

Specification URL (additional information):

None