Description of problem: Compute nodes have been deleted from overcloud stack at undercloud level when commenting out a yaml file in the middle of the deploy script [1]. While doing some troubleshooting I wanted to skip one of the config files to discard it as a potential root cause leading to skip not only that config but also roles data file. According to undercloud the servers/instances related to compute nodes are gone when issuing openstack server list, also there is no instances are shown when running openstack baremetal node list. The compute nodes were shutdown but powering them up confirmed the OS and data (instances included) were not purged and everything seems to be ok at the overcloud level (all nodes reported in nova and neutron), existing instances can be started and new ones can be spawned. The deploy got to UPDATE_FAILED status, not sure if that might have helped us to avoid an actual purge on the compute nodes. I think it would be needed to restore the undercloud database to a point where the mapping between nodes and instances is fixed by means of SQL or openstackclient / CLI. [1] http://pastebin.test.redhat.com/578960 Version-Release number of selected component (if applicable): ROD Ocata How reproducible: Not tested Steps to Reproduce: 1. Deploy openstack using a deploy script similar to [1] without commenting out any config file. 2. Re-run deploy commenting out a config file in the middle section. 3. Actual results: Overcloud compute nodes deleted from the overcloud stack. Expected results: Potentially some check for roles data file and if not found some alert but it is not the purpose for this BZ Additional info:
We do have an undercloud snapshot taken before the last minor update (3 weeks ago). I think that would be the safest recovery path and then go for a minor update.
We have replicated the issue on staging and reverting the snapshot seems to do the trick. We might do it on production environment too.
After restoring from the snapshot everything seems to be ok (openstack baremetal node list & openstack server list). As long as the restore was taken right before the last minor update, we have successfully run a minor update on the undercloud. We want to run a deploy to confirm everything is ok before moving on but we have had the following issues: overcloud.Controller.1.UpdateDeployment: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: 4873582f-4633-42fa-bac5-d3cb6b3bb65d status: UPDATE_FAILED status_reason: | UPDATE aborted deploy_stdout: | Started yum_update.sh on server 7de8d1ee-7cc9-4811-a3a1-5f878469feb4 at Thu Jan 25 10:05:03 UTC 2018 Not running due to unset update_identifier deploy_stderr: | overcloud.Controller.0.UpdateDeployment: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: 2bdfe429-726b-49af-a303-3870ad2c2848 status: UPDATE_FAILED status_reason: | UPDATE aborted deploy_stdout: | Started yum_update.sh on server 60e7413f-4ff9-45ff-a50c-645be4610d7f at Thu Jan 25 10:05:59 UTC 2018 Not running due to unset update_identifier deploy_stderr: | So just wondering: * Should a deploy be expected to succeed in this situation? * Should we go for an overcloud minor update? * should we go for openstack overcloud deploy --update-plan-only and then the deploy?