Bug 1569293

Summary:	Need to add deleted compute nodes back to the overcloud stack in the undercloud
Product:	[Community] RDO	Reporter:	David Manchado <dmanchad>
Component:	openstack-tripleo	Assignee:	James Slagle <jslagle>
Status:	CLOSED EOL	QA Contact:	Shai Revivo <srevivo>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	Ocata	CC:	kforde
Target Milestone:	---
Target Release:	trunk
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-06-15 20:08:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description David Manchado 2018-04-19 00:29:18 UTC

Description of problem:
Compute nodes have been deleted from overcloud stack at undercloud level when commenting out a yaml file in the middle of the deploy script [1].
While doing some troubleshooting I wanted to skip one of the config files to discard it as a potential root cause leading to skip not only that config but also roles data file.

According to undercloud the servers/instances related to compute nodes are gone when issuing openstack server list, also there is no instances are shown when running openstack baremetal node list.

The compute nodes were shutdown but powering them up confirmed the OS and data (instances included) were not purged and everything seems to be ok at the overcloud level (all nodes reported in nova and neutron), existing instances can be started and new ones can be spawned.

The deploy got to UPDATE_FAILED status, not sure if that might have helped us to avoid an actual purge on the compute nodes.

I think it would be needed to restore the undercloud database to a point where the mapping between nodes and instances is fixed by means of SQL or openstackclient / CLI.

[1] http://pastebin.test.redhat.com/578960

Version-Release number of selected component (if applicable):
ROD Ocata

How reproducible:
Not tested

Steps to Reproduce:
1. Deploy openstack using a deploy script similar to [1] without commenting out any config file.
2. Re-run deploy commenting out a config file in the middle section.
3.

Actual results:
Overcloud compute nodes deleted from the overcloud stack.

Expected results:
Potentially some check for roles data file and if not found some alert but it is not the purpose for this BZ

Additional info:

Comment 2 David Manchado 2018-04-19 08:54:49 UTC

We do have an undercloud snapshot taken before the last minor update (3 weeks ago). 
I think that would be the safest recovery path and then go for a minor update.

Comment 3 David Manchado 2018-04-20 16:43:05 UTC

We have replicated the issue on staging and reverting the snapshot seems to do the trick.

We might do it on production environment too.

Comment 4 David Manchado 2018-04-25 11:01:12 UTC

After restoring from the snapshot everything seems to be ok (openstack baremetal node list & openstack server list).

As long as the restore was taken right before the last minor update, we have successfully run a minor update on the undercloud.

We want to run a deploy to confirm everything is ok before moving on but we have had the following issues:

overcloud.Controller.1.UpdateDeployment:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 4873582f-4633-42fa-bac5-d3cb6b3bb65d
  status: UPDATE_FAILED
  status_reason: |
    UPDATE aborted
  deploy_stdout: |
    Started yum_update.sh on server 7de8d1ee-7cc9-4811-a3a1-5f878469feb4 at Thu Jan 25 10:05:03 UTC 2018
    Not running due to unset update_identifier
  deploy_stderr: |

overcloud.Controller.0.UpdateDeployment:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 2bdfe429-726b-49af-a303-3870ad2c2848
  status: UPDATE_FAILED
  status_reason: |
    UPDATE aborted
  deploy_stdout: |
    Started yum_update.sh on server 60e7413f-4ff9-45ff-a50c-645be4610d7f at Thu Jan 25 10:05:59 UTC 2018
    Not running due to unset update_identifier
  deploy_stderr: |
 
So just wondering:
* Should a deploy be expected to succeed in this situation?
* Should we go for an overcloud minor update?
* should we go for openstack overcloud deploy --update-plan-only and then the deploy?