Bug 1575572

Summary: Can't remove an overcloud node with wrong network configuration (not accessible)
Product: Red Hat OpenStack Reporter: Eduard Barrera <ebarrera>
Component: openstack-tripleoAssignee: James Slagle <jslagle>
Status: CLOSED NOTABUG QA Contact: Arik Chernetsky <achernet>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 11.0 (Ocata)CC: aschultz, bfournie, dtantsur, ebarrera, hjensas, mburns
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-06 11:56:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eduard Barrera 2018-05-07 10:24:27 UTC
Description of problem:

After deploying a node with wrong network configuration (not accessible via network) it is not possible to continue operating openstack and not possible to remove it.

It has been tried with [1]


[1] openstack overcloud node delete ..
[2] openstack server delete ...

Version-Release number of selected component (if applicable):
OSP 11

How reproducible:
always


Steps to Reproduce:

# Node is shutted down to simulate node not accessible

[root@ds-gpu-compute-001 tmp]# date
Fri May  4 15:26:47 CEST 2018



[root@ds-gpu-compute-001 tmp]# halt <===



Connection to 172.1x.74 closed by remote host.
Connection to 172.1x.74 closed.
[stack@pollux-tds-undercloud ~]$ 

# I try deleting the node

[stack@pollux-tds-undercloud ~]$ openstack server list
+--------------------------------------+------------------------------+--------+-----------------------+----------------+
| ID                                   | Name                         | Status | Networks              | Image Name     |
+--------------------------------------+------------------------------+--------+-----------------------+----------------+
| 9efadcbd-51d6-4ae0-87ff-301dd8ff6acd | ds-gpu-compute-001   | ACTIVE | ctlplane=172.1x.74 | overcloud-full |
| 08d5fe68-a6a9-4e2f-aaa3-52bbf38a428d | ds-compute-002       | ACTIVE | ctlplane=172.1x.81 | overcloud-full |
| 31728d9e-99e0-4197-93f2-fc3e5b4d0f56 | ds-leone-compute-001 | ACTIVE | ctlplane=172.1x.66 | overcloud-full |
| 7b969d71-5188-4947-9009-cc30c10a92b4 | ds-leone-compute-002 | ACTIVE | ctlplane=172.1x.70 | overcloud-full |
| d3687305-bb29-4ed5-a374-1dbf47fb5275 | ds-cephstorage-003   | ACTIVE | ctlplane=172.1x.65 | overcloud-full |
| 7ef1310d-d1ed-4c14-91c4-66aba69e0203 | ds-controller-1      | ACTIVE | ctlplane=172.1x.76 | overcloud-full |
| 114f6bee-9992-4176-ad15-b9b61ef30432 | ds-controller-3      | ACTIVE | ctlplane=172.1x.62 | overcloud-full |
| 034f9420-1e60-4677-9a56-07c16ee5356d | ds-cephstorage-001   | ACTIVE | ctlplane=172.1x.71 | overcloud-full |
| 183b274d-4f81-4c2f-8bdc-8156b59e2a2a | ds-controller-2      | ACTIVE | ctlplane=172.1x.68 | overcloud-full |
| d752647f-98cc-486d-b276-b7eaadea101b | ds-compute-001       | ACTIVE | ctlplane=172.1x.72 | overcloud-full |
| 1591e18e-d5de-45aa-b45f-7d75d1cd440b | ds-cephstorage-002   | ACTIVE | ctlplane=172.1x.73 | overcloud-full |
+--------------------------------------+------------------------------+--------+-----------------------+----------------+
[stack@pollux-tds-undercloud ~]$ openstack stack list
+--------------------------------------+------------+-----------------+----------------------+----------------------+
| ID                                   | Stack Name | Stack Status    | Creation Time        | Updated Time         |
+--------------------------------------+------------+-----------------+----------------------+----------------------+
| e7db2d58-6e56-42d7-bfa4-30b9c4306f5a | overcloud  | UPDATE_COMPLETE | 2018-05-03T08:57:50Z | 2018-05-03T15:07:43Z |
+--------------------------------------+------------+-----------------+----------------------+----------------------+


$ openstack overcloud node delete --stack overcloud \
 --templates ~/templates/ \
 -e ~/templates/environments/network-isolation.yaml \
 -e ~/templates/environments/network-management.yaml \
 -e ~/templates/environments/puppet-pacemaker.yaml \
 -e ~/templates/environments/neutron-ovs-dvr.yaml \
 -e ~/templates/environments/pollux-tds-storage-environment.yaml \
 -e ~/templates/environments/cinder-backup.yaml \
 -e ~/templates/environments/services/mistral.yaml \
 -e ~/templates/environments/services/sahara.yaml \
 -e ~/templates/environments/tls-endpoints-public-dns.yaml \
 -e ~/templates/environments/updates/update-from-keystone-admin-internal-api.yaml \
 -e ~/templates/environments/sshd-banner.yaml \
 -e ~/templates/environments/enable-tls.yaml \
 -e ~/templates/environments/network-environment.yaml \
 -e ~/templates/environments/environment.yaml \
 -e ~/templates/environments/hostname-mapping.yaml \
 -e ~/templates/environments/swift-external.yaml \
 -e ~/templates/environments/fencing.yaml 

[stack@pollux-tds-undercloud deploy-scripts]$ bash delete-gpu.sh
Deleting the following nodes from stack overcloud:
- 9efadcbd-51d6-4ae0-87ff-301dd8ff6acd
Started Mistral Workflow tripleo.scale.v1.delete_node. Execution ID: 9d1bd304-f148-4fa6-96e7-db9689b788cf
Waiting for messages on queue '2dd73d94-f017-4910-820b-e3237ed0423d' with no timeout.

### after many hours, since there's a timeout, the stack deployment fails... 

[stack@pollux-tds-undercloud ~]$ openstack stack list
+--------------------------------------+------------+---------------+----------------------+----------------------+
| ID                                   | Stack Name | Stack Status  | Creation Time        | Updated Time         |
+--------------------------------------+------------+---------------+----------------------+----------------------+
| e7db2d58-6e56-42d7-bfa4-30b9c4306f5a | overcloud  | UPDATE_FAILED | 2018-05-03T08:57:50Z | 2018-05-04T13:33:33Z |
+--------------------------------------+------------+---------------+----------------------+----------------------+
[stack@pollux-tds-undercloud ~]$ date
Mon May  7 09:14:15 CEST 2018





Actual results:
stack not updatable

Expected results:
able to delete the node 

What we should do in this case ?

Comment 2 Dmitry Tantsur 2018-05-07 11:04:47 UTC
Hi! It's hard to tell what went wrong from a quick glance, but http://tripleo.org/install/troubleshooting/troubleshooting-nodes.html#how-do-i-repair-broken-nodes may be the answer.

Comment 4 Dmitry Tantsur 2018-05-08 13:24:35 UTC
I don't see signs of problems on the nova/ironic side at the first glance. First, what exactly failure does Heat show? Second, you said you tried 'openstack server delete', what was the result? What was the final state of the nova instance and the ironic node?

Comment 7 Dmitry Tantsur 2018-05-15 12:12:28 UTC
Hi,

So, it's a bit unclear whether the node were deleted and properly cleaned up after they got problematic.

Just to be clear on terminology:
1. Ironic node delete ('openstack baremetal node delete') removes it from the inventory completely.
2. Instance delete ('openstack server delete') removes the Nova instance and unprovisions the node. It does not do #1.
3. Overcloud node delete ('openstack overcloud node delete') removes the node from the Heat stack and does #2, but NOT #1!

What I hear from you is that #3 fails with not found error, so the node is no longer in the stack. We just have to unprovision it.

So, the plan is:

1. Try undeploying instance with 'openstack server delete'.
2. If it fails and you don't understand why, follow http://tripleo.org/install/troubleshooting/troubleshooting-nodes.html#how-do-i-repair-broken-nodes.
3. In both cases wipe the node's hard drive and power it off before doing anything else!

Now, as to the warning on that page. Force-deleting a node may prevent Ironic and Nova from cleaning up resources associated with it. Specifically, it will stay powered on and ports won't be disconnected. A left that warning mostly because people used to take this procedure too easily, doing it every time they had a deployment failure.

Hope that helps,
Dmitry

Comment 9 Bob Fournier 2018-05-16 22:54:14 UTC
Eduard - based on comments 7 and 8, do you have all the info you need to close this?

Comment 10 Bob Fournier 2018-06-06 11:56:29 UTC
Eduard - I think the info has been provided, please reopen this if not the case. Thanks.

Comment 11 Red Hat Bugzilla 2023-09-15 01:27:13 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days