Description of problem: While replacing the controller node, the deployment either goes in nova error state, or deployment is failing that the ctrl-0 cannot be de-register or timeout. Version-Release number of selected component (if applicable): openstack-nova-api-16.0.2-3.el7ost.noarch openstack-nova-common-16.0.2-3.el7ost.noarch openstack-nova-compute-16.0.2-3.el7ost.noarch openstack-nova-conductor-16.0.2-3.el7ost.noarch openstack-nova-console-16.0.2-3.el7ost.noarch openstack-nova-migration-16.0.2-3.el7ost.noarch openstack-nova-novncproxy-16.0.2-3.el7ost.noarch openstack-nova-placement-api-16.0.2-3.el7ost.noarch openstack-nova-scheduler-16.0.2-3.el7ost.noarch puppet-nova-11.4.0-2.el7ost.noarch python-nova-16.0.2-3.el7ost.noarch python-novaclient-9.1.1-1.el7ost.noarch How reproducible: 100% Steps to Reproduce: We are following documented procedure. https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/director_installation_and_usage/sect-scaling_the_overcloud#sect-Replacing_Controller_Nodes. Actual results: The deployment is failing with Nova getting into an error state, or deployment is failing that the ctrl-0 cannot be de-register or its getting timeout. Expected results: controller node should get replaced successfully. Additional info: We are following a documented[1] procedure for replacing a crashed controller node. Simulate the same on lab (using - dd if=/dev/zero of=/dev/vda bs=8M) but the documented procedure is NOT working. When the controller has crashed the deployment either goes in: -Nova error state -the controller is not able to de-register -Timeout. Please see below the behavior when the controller is crashed and not reachable: In this case, the stack will fail when trying to deregister the failed node. This will eventually time out and an update will fail. (undercloud) [stack@director-2 templates]$ openstack stack list --nested | grep -v COMPLETE +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+--------------------+----------------------+----------------------+--------------------------------------+ | ID | Stack Name | Project | Stack Status | Creation Time | Updated Time | Parent | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+--------------------+----------------------+----------------------+--------------------------------------+ | 7bfe7da7-8af4-4e4f-b866-e8b1f01f5aab | overcloud-Controller-jekus5wgenab-0-vtnxm5xjd3ef-NodeExtraConfig-utzgwodkzofq | d4f6510b19cf4a72a742aceabcd8009c | DELETE_IN_PROGRESS | 2018-11-22T09:41:40Z | None | f0bdb7b0-e27c-467f-9099-8f26d275f47f | | f0bdb7b0-e27c-467f-9099-8f26d275f47f | overcloud-Controller-jekus5wgenab-0-vtnxm5xjd3ef | d4f6510b19cf4a72a742aceabcd8009c | DELETE_IN_PROGRESS | 2018-11-22T09:27:57Z | None | 0de8194e-145f-468c-8b7c-10451147be60 | | 0de8194e-145f-468c-8b7c-10451147be60 | overcloud-Controller-jekus5wgenab | d4f6510b19cf4a72a742aceabcd8009c | UPDATE_IN_PROGRESS | 2018-11-22T09:27:30Z | 2018-11-22T13:00:08Z | 463a3ab4-61a5-4b79-8f28-0f246a4cc673 | | 463a3ab4-61a5-4b79-8f28-0f246a4cc673 | overcloud | d4f6510b19cf4a72a742aceabcd8009c | UPDATE_IN_PROGRESS | 2018-11-22T09:23:26Z | 2018-11-22T12:54:45Z | None | +--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+--------------------+----------------------+----------------------+--------------------------------------+ (undercloud) [stack@director-2 templates]$ openstack stack resource list 7bfe7da7-8af4-4e4f-b866-e8b1f01f5aab +------------------------------+--------------------------------------+------------------------------+--------------------+----------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | +------------------------------+--------------------------------------+------------------------------+--------------------+----------------------+ | RHELUnregistration | e8a9cd03-78c7-44d4-9b96-1f69dd7108c7 | OS::Heat::SoftwareConfig | CREATE_COMPLETE | 2018-11-22T09:41:41Z | | RHELUnregistrationDeployment | 4e9bd9e2-d81e-485d-9c14-7fb679fe5b29 | OS::Heat::SoftwareDeployment | DELETE_IN_PROGRESS | 2018-11-22T09:41:41Z | +------------------------------+--------------------------------------+------------------------------+--------------------+----------------------+ Even though the new controller (ctrl-3) is created, the failed ctrl-0 will never be deleted: (undercloud) [stack@director-2 templates]$ nova list +--------------------------------------+--------+--------+------------+-------------+------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------+--------+------------+-------------+------------------------+ | 36cd4688-aa10-4525-91e2-d9bbdf1fcf54 | c-0 | ACTIVE | - | Running | ctlplane=192.168.24.15 | | 569f16e8-a803-4b0b-bafc-0df77284e14f | c-1 | ACTIVE | - | Running | ctlplane=192.168.24.14 | | af19d84d-c74a-4855-a18c-679813332aee | ceph-0 | ACTIVE | - | Running | ctlplane=192.168.24.8 | | 07c3a292-ff84-4d31-b766-84203eb5f5fa | ceph-1 | ACTIVE | - | Running | ctlplane=192.168.24.9 | | 3b852fee-6a6b-4b99-84c0-e495b7cdd3cc | ceph-2 | ACTIVE | - | Running | ctlplane=192.168.24.16 | | d54da3a8-9f4a-4766-843a-86b27cb46d3e | ctrl-0 | ACTIVE | - | Running | ctlplane=192.168.24.18 | | abce4858-301e-4693-9326-f9e13aac8f04 | ctrl-1 | ACTIVE | - | Running | ctlplane=192.168.24.12 | | cc502ca3-f621-45de-b5c7-2782be4915ab | ctrl-2 | ACTIVE | - | Running | ctlplane=192.168.24.19 | | aa5c3ea2-ae40-44d7-9443-ad2a431ec1e5 | ctrl-3 | ACTIVE | - | Running | ctlplane=192.168.24.24 | +--------------------------------------+--------+--------+------------+-------------+------------------------+ For time being we thought of manually deleting the controller using nova and continue with the procedure but not sure how safe this procedure is? [1]https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/director_installation_and_usage/sect-scaling_the_overcloud#sect-Replacing_Controller_Nodes.
Hello team, Can we have an update here?
I was reviewing the BZ information again and the step which is failing is the unregister from Satellite, not the delete from nova as the BZ description says. { "status": "FAILED", "server_id": "e492fe24-7462-4107-be28-af9f41263fab", "config_id": "dbf136da-dbb2-454a-8db5-35d3c8a312f7", "output_values": null, "creation_time": "2018-10-31T15:07:09Z", "updated_time": "2018-11-02T14:03:18Z", "input_values": { "REG_METHOD": "satellite" }, "action": "DELETE", "status_reason": "Deployment cancelled.", "id": "3a965793-bb1b-4a1b-9b52-b2f135edab11" } # openstack stack failures list --long overcloud overcloud.Controller.0.NodeExtraConfig.RHELUnregistrationDeployment: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: 3a965793-bb1b-4a1b-9b52-b2f135edab11 status: DELETE_FAILED status_reason: | DELETE aborted deploy_stdout: | None deploy_stderr: | None The unregister is a step running on the node, which now fails as it is in broken state. [1] describes a way to signal that to the RHELUnregistrationDeployment resource that it finished. [1] https://access.redhat.com/solutions/2260561