Bug 1575572 - Can't remove an overcloud node with wrong network configuration (not accessible)
Summary: Can't remove an overcloud node with wrong network configuration (not accessible)
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo
Version: 11.0 (Ocata)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: James Slagle
QA Contact: Arik Chernetsky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-07 10:24 UTC by Eduard Barrera
Modified: 2023-09-15 01:27 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-06 11:56:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-9015 0 None None None 2022-08-09 10:48:09 UTC

Description Eduard Barrera 2018-05-07 10:24:27 UTC
Description of problem:

After deploying a node with wrong network configuration (not accessible via network) it is not possible to continue operating openstack and not possible to remove it.

It has been tried with [1]


[1] openstack overcloud node delete ..
[2] openstack server delete ...

Version-Release number of selected component (if applicable):
OSP 11

How reproducible:
always


Steps to Reproduce:

# Node is shutted down to simulate node not accessible

[root@ds-gpu-compute-001 tmp]# date
Fri May  4 15:26:47 CEST 2018



[root@ds-gpu-compute-001 tmp]# halt <===



Connection to 172.1x.74 closed by remote host.
Connection to 172.1x.74 closed.
[stack@pollux-tds-undercloud ~]$ 

# I try deleting the node

[stack@pollux-tds-undercloud ~]$ openstack server list
+--------------------------------------+------------------------------+--------+-----------------------+----------------+
| ID                                   | Name                         | Status | Networks              | Image Name     |
+--------------------------------------+------------------------------+--------+-----------------------+----------------+
| 9efadcbd-51d6-4ae0-87ff-301dd8ff6acd | ds-gpu-compute-001   | ACTIVE | ctlplane=172.1x.74 | overcloud-full |
| 08d5fe68-a6a9-4e2f-aaa3-52bbf38a428d | ds-compute-002       | ACTIVE | ctlplane=172.1x.81 | overcloud-full |
| 31728d9e-99e0-4197-93f2-fc3e5b4d0f56 | ds-leone-compute-001 | ACTIVE | ctlplane=172.1x.66 | overcloud-full |
| 7b969d71-5188-4947-9009-cc30c10a92b4 | ds-leone-compute-002 | ACTIVE | ctlplane=172.1x.70 | overcloud-full |
| d3687305-bb29-4ed5-a374-1dbf47fb5275 | ds-cephstorage-003   | ACTIVE | ctlplane=172.1x.65 | overcloud-full |
| 7ef1310d-d1ed-4c14-91c4-66aba69e0203 | ds-controller-1      | ACTIVE | ctlplane=172.1x.76 | overcloud-full |
| 114f6bee-9992-4176-ad15-b9b61ef30432 | ds-controller-3      | ACTIVE | ctlplane=172.1x.62 | overcloud-full |
| 034f9420-1e60-4677-9a56-07c16ee5356d | ds-cephstorage-001   | ACTIVE | ctlplane=172.1x.71 | overcloud-full |
| 183b274d-4f81-4c2f-8bdc-8156b59e2a2a | ds-controller-2      | ACTIVE | ctlplane=172.1x.68 | overcloud-full |
| d752647f-98cc-486d-b276-b7eaadea101b | ds-compute-001       | ACTIVE | ctlplane=172.1x.72 | overcloud-full |
| 1591e18e-d5de-45aa-b45f-7d75d1cd440b | ds-cephstorage-002   | ACTIVE | ctlplane=172.1x.73 | overcloud-full |
+--------------------------------------+------------------------------+--------+-----------------------+----------------+
[stack@pollux-tds-undercloud ~]$ openstack stack list
+--------------------------------------+------------+-----------------+----------------------+----------------------+
| ID                                   | Stack Name | Stack Status    | Creation Time        | Updated Time         |
+--------------------------------------+------------+-----------------+----------------------+----------------------+
| e7db2d58-6e56-42d7-bfa4-30b9c4306f5a | overcloud  | UPDATE_COMPLETE | 2018-05-03T08:57:50Z | 2018-05-03T15:07:43Z |
+--------------------------------------+------------+-----------------+----------------------+----------------------+


$ openstack overcloud node delete --stack overcloud \
 --templates ~/templates/ \
 -e ~/templates/environments/network-isolation.yaml \
 -e ~/templates/environments/network-management.yaml \
 -e ~/templates/environments/puppet-pacemaker.yaml \
 -e ~/templates/environments/neutron-ovs-dvr.yaml \
 -e ~/templates/environments/pollux-tds-storage-environment.yaml \
 -e ~/templates/environments/cinder-backup.yaml \
 -e ~/templates/environments/services/mistral.yaml \
 -e ~/templates/environments/services/sahara.yaml \
 -e ~/templates/environments/tls-endpoints-public-dns.yaml \
 -e ~/templates/environments/updates/update-from-keystone-admin-internal-api.yaml \
 -e ~/templates/environments/sshd-banner.yaml \
 -e ~/templates/environments/enable-tls.yaml \
 -e ~/templates/environments/network-environment.yaml \
 -e ~/templates/environments/environment.yaml \
 -e ~/templates/environments/hostname-mapping.yaml \
 -e ~/templates/environments/swift-external.yaml \
 -e ~/templates/environments/fencing.yaml 

[stack@pollux-tds-undercloud deploy-scripts]$ bash delete-gpu.sh
Deleting the following nodes from stack overcloud:
- 9efadcbd-51d6-4ae0-87ff-301dd8ff6acd
Started Mistral Workflow tripleo.scale.v1.delete_node. Execution ID: 9d1bd304-f148-4fa6-96e7-db9689b788cf
Waiting for messages on queue '2dd73d94-f017-4910-820b-e3237ed0423d' with no timeout.

### after many hours, since there's a timeout, the stack deployment fails... 

[stack@pollux-tds-undercloud ~]$ openstack stack list
+--------------------------------------+------------+---------------+----------------------+----------------------+
| ID                                   | Stack Name | Stack Status  | Creation Time        | Updated Time         |
+--------------------------------------+------------+---------------+----------------------+----------------------+
| e7db2d58-6e56-42d7-bfa4-30b9c4306f5a | overcloud  | UPDATE_FAILED | 2018-05-03T08:57:50Z | 2018-05-04T13:33:33Z |
+--------------------------------------+------------+---------------+----------------------+----------------------+
[stack@pollux-tds-undercloud ~]$ date
Mon May  7 09:14:15 CEST 2018





Actual results:
stack not updatable

Expected results:
able to delete the node 

What we should do in this case ?

Comment 2 Dmitry Tantsur 2018-05-07 11:04:47 UTC
Hi! It's hard to tell what went wrong from a quick glance, but http://tripleo.org/install/troubleshooting/troubleshooting-nodes.html#how-do-i-repair-broken-nodes may be the answer.

Comment 4 Dmitry Tantsur 2018-05-08 13:24:35 UTC
I don't see signs of problems on the nova/ironic side at the first glance. First, what exactly failure does Heat show? Second, you said you tried 'openstack server delete', what was the result? What was the final state of the nova instance and the ironic node?

Comment 7 Dmitry Tantsur 2018-05-15 12:12:28 UTC
Hi,

So, it's a bit unclear whether the node were deleted and properly cleaned up after they got problematic.

Just to be clear on terminology:
1. Ironic node delete ('openstack baremetal node delete') removes it from the inventory completely.
2. Instance delete ('openstack server delete') removes the Nova instance and unprovisions the node. It does not do #1.
3. Overcloud node delete ('openstack overcloud node delete') removes the node from the Heat stack and does #2, but NOT #1!

What I hear from you is that #3 fails with not found error, so the node is no longer in the stack. We just have to unprovision it.

So, the plan is:

1. Try undeploying instance with 'openstack server delete'.
2. If it fails and you don't understand why, follow http://tripleo.org/install/troubleshooting/troubleshooting-nodes.html#how-do-i-repair-broken-nodes.
3. In both cases wipe the node's hard drive and power it off before doing anything else!

Now, as to the warning on that page. Force-deleting a node may prevent Ironic and Nova from cleaning up resources associated with it. Specifically, it will stay powered on and ports won't be disconnected. A left that warning mostly because people used to take this procedure too easily, doing it every time they had a deployment failure.

Hope that helps,
Dmitry

Comment 9 Bob Fournier 2018-05-16 22:54:14 UTC
Eduard - based on comments 7 and 8, do you have all the info you need to close this?

Comment 10 Bob Fournier 2018-06-06 11:56:29 UTC
Eduard - I think the info has been provided, please reopen this if not the case. Thanks.

Comment 11 Red Hat Bugzilla 2023-09-15 01:27:13 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days


Note You need to log in before you can comment on or make changes to this bug.