Bug 1041080

Summary: [RFE][nova]: Baremetal nodes can be migrated among compute hosts
Product: Red Hat OpenStack Reporter: RHOS Integration <rhos-integ>
Component: RFEsAssignee: RHOS Maint <rhos-maint>
Status: CLOSED UPSTREAM QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: markmc, yeylon
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
URL: https://blueprints.launchpad.net/nova/+spec/baremetal-compute-takeover
Whiteboard: upstream_milestone_none upstream_status_not-started upstream_definition_obsolete
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-03-19 16:49:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description RHOS Integration 2013-12-12 13:35:06 UTC
Cloned from launchpad blueprint https://blueprints.launchpad.net/nova/+spec/baremetal-compute-takeover.

Description:

In a baremetal cloud with multiple nova-compute hosts, each nova-compute host is a SPoF for the baremetal nodes which it manages. There is currently no mechanism to move a node from one compute-host to another compute-host, either manually or automatically; doing so requires deleting the node and adding it again, which will invalidate any instance currently deployed to that node. 

It is also worth pointing out that, if a nova-compute host goes offline, Nova is not able to control the baremetal nodes managed by that host, though any existing instances should continue to function as long as they do not restart.

Moving a node to another compute host could be accomplished by:
- adding a new bm state "migrating"
- adding a method to rebuild the tftp environment for a deployed instance on a new compute host.
- finding a means to update nova scheduler such that the (host, hypervisor_hostname) can change. This would need to be possible regardless of whether an instance was active on that compute node.

Additionally, by tracking the status in the nova_bm database, for each node, of the compute host which owns it, other compute hosts could "take over" for a dead host. This would require the following changes:
- add a timestamp column to bm_nodes table
- compute host periodic task that updates the timestamp
- compute host periodic task that looks for bm_nodes whose compute host has not checked in, and initiates take-over, with a distributed (iow, db-managed) lock on that node, compute_host, and instance.


This was discussed during Havana summit here:
  https://etherpad.openstack.org/HavanaBaremetalNextSteps

Specification URL (additional information):

None