Bug 1652784 - Failed Controller node is not getting removed from nova and stack deployment fails.
Summary: Failed Controller node is not getting removed from nova and stack deployment ...
Keywords:
Status: CLOSED DUPLICATE of bug 1313885
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Linux
urgent
urgent
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Gurenko Alex
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-23 02:32 UTC by rohit londhe
Modified: 2022-03-13 17:07 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-28 12:18:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-13864 0 None None None 2022-03-13 17:07:44 UTC

Description rohit londhe 2018-11-23 02:32:02 UTC
Description of problem:

While replacing the controller node, the deployment either goes in nova error state, or deployment is failing that the ctrl-0 cannot be de-register or timeout.

Version-Release number of selected component (if applicable):

openstack-nova-api-16.0.2-3.el7ost.noarch                   
openstack-nova-common-16.0.2-3.el7ost.noarch                
openstack-nova-compute-16.0.2-3.el7ost.noarch                
openstack-nova-conductor-16.0.2-3.el7ost.noarch                
openstack-nova-console-16.0.2-3.el7ost.noarch                
openstack-nova-migration-16.0.2-3.el7ost.noarch             
openstack-nova-novncproxy-16.0.2-3.el7ost.noarch            
openstack-nova-placement-api-16.0.2-3.el7ost.noarch         
openstack-nova-scheduler-16.0.2-3.el7ost.noarch             
puppet-nova-11.4.0-2.el7ost.noarch                          
python-nova-16.0.2-3.el7ost.noarch                          
python-novaclient-9.1.1-1.el7ost.noarch    

How reproducible:
100%

Steps to Reproduce:

We are following documented procedure.

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/director_installation_and_usage/sect-scaling_the_overcloud#sect-Replacing_Controller_Nodes. 

Actual results:
The deployment is failing with Nova getting into an error state, or deployment is failing that the ctrl-0 cannot be de-register or its getting timeout.

Expected results:

controller node should get replaced successfully.

Additional info:

We are following a documented[1] procedure for replacing a crashed controller node.

Simulate the same on lab (using - dd if=/dev/zero of=/dev/vda bs=8M) but the documented procedure is NOT working.

When the controller has crashed the deployment either goes in:

-Nova error state
-the controller is not able to de-register
-Timeout.

Please see below the behavior when the controller is crashed and not reachable: 

In this case, the stack will fail when trying to deregister the failed node. This will eventually time out and an update will fail. 

(undercloud) [stack@director-2 templates]$ openstack stack list --nested | grep -v COMPLETE
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+--------------------+----------------------+----------------------+--------------------------------------+
| ID                                   | Stack Name                                                                                                                                                               | Project                          | Stack Status       | Creation Time        | Updated Time         | Parent                               |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+--------------------+----------------------+----------------------+--------------------------------------+
| 7bfe7da7-8af4-4e4f-b866-e8b1f01f5aab | overcloud-Controller-jekus5wgenab-0-vtnxm5xjd3ef-NodeExtraConfig-utzgwodkzofq                                                                                            | d4f6510b19cf4a72a742aceabcd8009c | DELETE_IN_PROGRESS | 2018-11-22T09:41:40Z | None                 | f0bdb7b0-e27c-467f-9099-8f26d275f47f |
| f0bdb7b0-e27c-467f-9099-8f26d275f47f | overcloud-Controller-jekus5wgenab-0-vtnxm5xjd3ef                                                                                                                         | d4f6510b19cf4a72a742aceabcd8009c | DELETE_IN_PROGRESS | 2018-11-22T09:27:57Z | None                 | 0de8194e-145f-468c-8b7c-10451147be60 |
| 0de8194e-145f-468c-8b7c-10451147be60 | overcloud-Controller-jekus5wgenab                                                                                                                                        | d4f6510b19cf4a72a742aceabcd8009c | UPDATE_IN_PROGRESS | 2018-11-22T09:27:30Z | 2018-11-22T13:00:08Z | 463a3ab4-61a5-4b79-8f28-0f246a4cc673 |
| 463a3ab4-61a5-4b79-8f28-0f246a4cc673 | overcloud                                                                                                                                                                | d4f6510b19cf4a72a742aceabcd8009c | UPDATE_IN_PROGRESS | 2018-11-22T09:23:26Z | 2018-11-22T12:54:45Z | None                                 |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+--------------------+----------------------+----------------------+--------------------------------------+
(undercloud) [stack@director-2 templates]$ openstack stack resource list 7bfe7da7-8af4-4e4f-b866-e8b1f01f5aab
+------------------------------+--------------------------------------+------------------------------+--------------------+----------------------+
| resource_name                | physical_resource_id                 | resource_type                | resource_status    | updated_time         |
+------------------------------+--------------------------------------+------------------------------+--------------------+----------------------+
| RHELUnregistration           | e8a9cd03-78c7-44d4-9b96-1f69dd7108c7 | OS::Heat::SoftwareConfig     | CREATE_COMPLETE    | 2018-11-22T09:41:41Z |
| RHELUnregistrationDeployment | 4e9bd9e2-d81e-485d-9c14-7fb679fe5b29 | OS::Heat::SoftwareDeployment | DELETE_IN_PROGRESS | 2018-11-22T09:41:41Z |
+------------------------------+--------------------------------------+------------------------------+--------------------+----------------------+ 


Even though the new controller (ctrl-3) is created, the failed ctrl-0 will never be deleted: 

(undercloud) [stack@director-2 templates]$ nova list
+--------------------------------------+--------+--------+------------+-------------+------------------------+
| ID                                   | Name   | Status | Task State | Power State | Networks               |
+--------------------------------------+--------+--------+------------+-------------+------------------------+
| 36cd4688-aa10-4525-91e2-d9bbdf1fcf54 | c-0    | ACTIVE | -          | Running     | ctlplane=192.168.24.15 |
| 569f16e8-a803-4b0b-bafc-0df77284e14f | c-1    | ACTIVE | -          | Running     | ctlplane=192.168.24.14 |
| af19d84d-c74a-4855-a18c-679813332aee | ceph-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.8  |
| 07c3a292-ff84-4d31-b766-84203eb5f5fa | ceph-1 | ACTIVE | -          | Running     | ctlplane=192.168.24.9  |
| 3b852fee-6a6b-4b99-84c0-e495b7cdd3cc | ceph-2 | ACTIVE | -          | Running     | ctlplane=192.168.24.16 |
| d54da3a8-9f4a-4766-843a-86b27cb46d3e | ctrl-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.18 |
| abce4858-301e-4693-9326-f9e13aac8f04 | ctrl-1 | ACTIVE | -          | Running     | ctlplane=192.168.24.12 |
| cc502ca3-f621-45de-b5c7-2782be4915ab | ctrl-2 | ACTIVE | -          | Running     | ctlplane=192.168.24.19 |
| aa5c3ea2-ae40-44d7-9443-ad2a431ec1e5 | ctrl-3 | ACTIVE | -          | Running     | ctlplane=192.168.24.24 |
+--------------------------------------+--------+--------+------------+-------------+------------------------+ 

For time being we thought of manually deleting the controller using nova and continue with the procedure but not sure how safe this procedure is?

[1]https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/director_installation_and_usage/sect-scaling_the_overcloud#sect-Replacing_Controller_Nodes.

Comment 5 rohit londhe 2018-11-27 09:59:02 UTC
Hello team,

Can we have an update here?

Comment 7 Martin Schuppert 2018-11-27 10:40:13 UTC
I was reviewing the BZ information again and the step which is failing is the
unregister from Satellite, not the delete from nova as the BZ description 
says.

{
  "status": "FAILED", 
  "server_id": "e492fe24-7462-4107-be28-af9f41263fab", 
  "config_id": "dbf136da-dbb2-454a-8db5-35d3c8a312f7", 
  "output_values": null, 
  "creation_time": "2018-10-31T15:07:09Z", 
  "updated_time": "2018-11-02T14:03:18Z", 
  "input_values": {
    "REG_METHOD": "satellite"
  }, 
  "action": "DELETE", 
  "status_reason": "Deployment cancelled.", 
  "id": "3a965793-bb1b-4a1b-9b52-b2f135edab11"
}
# openstack stack failures list --long overcloud
overcloud.Controller.0.NodeExtraConfig.RHELUnregistrationDeployment:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 3a965793-bb1b-4a1b-9b52-b2f135edab11
  status: DELETE_FAILED
  status_reason: |
    DELETE aborted
  deploy_stdout: |
None
  deploy_stderr: |
None

The unregister is a step running on the node, which now fails as it is in broken state.
[1] describes a way to signal that to the RHELUnregistrationDeployment resource that it
finished.


[1] https://access.redhat.com/solutions/2260561


Note You need to log in before you can comment on or make changes to this bug.