Red Hat Bugzilla – Bug 1399429
Unable to delete overcloud node when identifying --stack by UUID, using name works however
Last modified: 2017-02-01 09:46:17 EST
I. Description of problem: I created an OsdCompute custom role [1] and was able to deploy [2] a physical 7 node overcloud with 3 Controllers and 4 OsdComputes. However, I was unable to delete one of my OsdComputes using `openstack overcloud node delete` and received an unrecognized argument error for '-r': [stack@hci-director ~]$ echo $stack_id 23e7c364-7303-4af6-b54d-cfbf1b737680 [stack@hci-director ~]$ echo $nova_id 5fa641cf-b290-4a2a-b15e-494ab9d10d8a [stack@hci-director ~]$ time openstack overcloud node delete --stack $stack_id --templates \ > -r ~/custom-templates/custom-roles.yaml \ > -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml \ > -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ > -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \ > -e ~/custom-templates/network.yaml \ > -e ~/custom-templates/ceph.yaml \ > -e ~/custom-templates/layout.yaml $nova_id usage: openstack overcloud node delete [-h] [--stack STACK] [--templates [TEMPLATES]] [-e <HEAT ENVIRONMENT FILE>] <node> [<node> ...] openstack overcloud node delete: error: unrecognized arguments: -r 5fa641cf-b290-4a2a-b15e-494ab9d10d8a real 0m0.758s user 0m0.501s sys 0m0.085s [stack@hci-director ~]$ [1] https://github.com/RHsyseng/hci/blob/master/custom-templates/custom-roles.yaml#L168 [2] https://github.com/RHsyseng/hci/blob/master/scripts/deploy.sh II. Version-Release number of selected component (if applicable): Reproduced using the puddle from 10.0-RHEL-7/2016-11-19.4. [stack@hci-director ~]$ rpm -qa | egrep tripleo | sort openstack-tripleo-0.0.8-0.2.4de13b3git.el7ost.noarch openstack-tripleo-common-5.4.0-2.el7ost.noarch openstack-tripleo-heat-templates-5.1.0-3.el7ost.noarch openstack-tripleo-image-elements-5.1.0-1.el7ost.noarch openstack-tripleo-puppet-elements-5.1.0-2.el7ost.noarch openstack-tripleo-ui-1.0.5-1.el7ost.noarch openstack-tripleo-validations-5.1.0-5.el7ost.noarch puppet-tripleo-5.4.0-2.el7ost.noarch python-tripleoclient-5.4.0-1.el7ost.noarch [stack@hci-director ~]$ III. How reproducible: Deterministic IV. Steps to Reproduce: 1. Deploy an overcloud which has nodes from a custom role as described in our docs* 2. Try to delete one of the nodes from the custom roles will keeping the rest of the overcloud running * https://access.redhat.com/documentation/en/red-hat-openstack-platform/10-beta/single/advanced-overcloud-customization/#example_3_creating_a_new_role V. Actual results: The custom role, OsdCompute in this case, is not deleted and instead an error is seen. VI. Expected results: The custom role, OsdCompute in this case, is deleted just like how a non-custom role, e.g. compute, is deleted. VII. Additional info: All of my Heat templates can be seen at: https://github.com/RHsyseng/hci/tree/master/custom-templates Attempting the delete without the -r produces the following error: [stack@hci-director ~]$ time openstack overcloud node delete --stack $stack_id --templates -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e ~/custom-templates/network.yaml -e ~/custom-templates/ceph.yaml -e ~/custom-templates/layout.yaml $nova_id deleting nodes [u'5fa641cf-b290-4a2a-b15e-494ab9d10d8a'] from stack 23e7c364-7303-4af6-b54d-cfbf1b737680 Started Mistral Workflow. Execution ID: 0bf0e91c-2f49-4f84-a46d-f251ee99e9fe {u'execution': {u'id': u'0bf0e91c-2f49-4f84-a46d-f251ee99e9fe', u'input': {u'container': u'23e7c364-7303-4af6-b54d-cfbf1b737680', u'nodes': [u'5fa641cf-b290-4a2a-b15e-494ab9d10d8a'], u'queue_name': u'668f46fb-1b76-46c8-898d-b260cc2c0996', u'timeout': 240}, u'name': u'tripleo.scale.v1.delete_node', u'params': {}, u'spec': {u'description': u'deletes given overcloud nodes and updates the stack', u'input': [u'container', u'nodes', {u'timeout': 240}, {u'queue_name': u'tripleo'}], u'name': u'delete_node', u'tasks': {u'delete_node': {u'action': u'tripleo.scale.delete_node nodes=<% $.nodes %> timeout=<% $.timeout %> container=<% $.container %>', u'name': u'delete_node', u'on-error': u'set_delete_node_failed', u'on-success': u'send_message', u'type': u'direct', u'version': u'2.0'}, u'send_message': {u'action': u'zaqar.queue_post', u'input': {u'messages': {u'body': {u'payload': {u'execution': u'<% execution() %>', u'message': u"<% $.get('message', '') %>", u'status': u"<% $.get('status', 'SUCCESS') %>"}, u'type': u'tripleo.scale.v1.delete_node'}}, u'queue_name': u'<% $.queue_name %>'}, u'name': u'send_message', u'retry': u'count=5 delay=1', u'type': u'direct', u'version': u'2.0'}, u'set_delete_node_failed': {u'name': u'set_delete_node_failed', u'on-success': u'send_message', u'publish': {u'message': u'<% task(delete_node).result %>', u'status': u'FAILED'}, u'type': u'direct', u'version': u'2.0'}}, u'version': u'2.0'}}, u'message': u"Failed to run action [action_ex_id=0d09a2ab-007d-4efa-893d-f7e26c2a8c91, action_cls='<class 'mistral.actions.action_factory.ScaleDownAction'>', attributes='{}', params='{u'nodes': [u'5fa641cf-b290-4a2a-b15e-494ab9d10d8a'], u'container': u'23e7c364-7303-4af6-b54d-cfbf1b737680', u'timeout': 240}']\n Environment not found [name=23e7c364-7303-4af6-b54d-cfbf1b737680]", u'status': u'FAILED'} real 1m37.188s user 0m0.515s sys 0m0.074s [stack@hci-director ~]$ No errors were reported by `openstack stack failures list overcloud`. [stack@hci-director ~]$ openstack stack failures list overcloud [stack@hci-director ~]$ openstack stack list +--------------------------------------+------------+-----------------+----------------------+--------------+ | ID | Stack Name | Stack Status | Creation Time | Updated Time | +--------------------------------------+------------+-----------------+----------------------+--------------+ | 23e7c364-7303-4af6-b54d-cfbf1b737680 | overcloud | CREATE_COMPLETE | 2016-11-24T03:24:56Z | None | +--------------------------------------+------------+-----------------+----------------------+--------------+ [stack@hci-director ~]$ openstack server list +--------------------------------------+-------------------------+--------+-----------------------+----------------+ | ID | Name | Status | Networks | Image Name | +--------------------------------------+-------------------------+--------+-----------------------+----------------+ | fc8686c1-a675-4c89-a508-cc1b34d5d220 | overcloud-controller-2 | ACTIVE | ctlplane=192.168.1.37 | overcloud-full | | 7c6ae5f3-7e18-4aa2-a1f8-53145647a3de | overcloud-osd-compute-2 | ACTIVE | ctlplane=192.168.1.30 | overcloud-full | | 5fa641cf-b290-4a2a-b15e-494ab9d10d8a | overcloud-osd-compute-3 | ACTIVE | ctlplane=192.168.1.21 | overcloud-full | | 851f76db-427c-42b3-8e0b-e8b4b19770f8 | overcloud-controller-0 | ACTIVE | ctlplane=192.168.1.33 | overcloud-full | | e2906507-6a06-4c4d-bd15-9f7de455e91d | overcloud-controller-1 | ACTIVE | ctlplane=192.168.1.29 | overcloud-full | | 0f93a712-b9eb-4f42-bc05-f2c8c2edfd81 | overcloud-osd-compute-0 | ACTIVE | ctlplane=192.168.1.32 | overcloud-full | | 8f266c17-ff39-422e-a935-effb219c7782 | overcloud-osd-compute-1 | ACTIVE | ctlplane=192.168.1.24 | overcloud-full | +--------------------------------------+-------------------------+--------+-----------------------+----------------+ [stack@hci-director ~]$ Our staged documentation: https://access.redhat.com/documentation/en/red-hat-openstack-platform/10-beta/single/director-installation-and-usage/#sect-Removing_Compute_Nodes Suggests running "openstack overcloud node delete --stack [STACK_UUID] --templates -e [ENVIRONMENT_FILE] [NODE1_UUID] [NODE2_UUID] [NODE3_UUID]" and "If you passed any extra environment files when you created the Overcloud, pass them here again using the -e or --environment-file option to avoid making undesired manual changes to the Overcloud." Does -r apply in the same way? If not, then the docs will probably need an update too.
I was able to delete my extra node by updating the Heat templates to shrink the node count and then re-running the deploy. Is my issue really just a docbug? Can someone confirm this is the recommended way to delete an overcloud node and then the BZ can be changed to a docbug? Details: I had originally set my OsdComputeCount to 4 and had pre-assigned IPs for the 4th node (overcloud-compute-3). I then just decremented the count and commented out the extra IPs: [stack@hci-director ~]$ egrep "\#|OsdComputeCount" ~/custom-templates/layout.yaml OsdComputeCount: 3 #- 192.168.2.206 #- 192.168.3.206 #- 172.16.1.206 #- 172.16.2.206 [stack@hci-director ~]$ From there I just re-ran my deploy script [1] and after that I had the desired behavior. Because the node to be deleted was a Ceph OSD, I had followed our procedure to manually remove the OSDs from the Ceph cluster as per our doc [2]. The same doc will also need an update since it suggests using `openstack overcloud node delete`. Here [3] are details on my overcloud after running the deploy again with the node count changed. [1] https://github.com/RHsyseng/hci/blob/master/scripts/deploy.sh [2] https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/single/red-hat-ceph-storage-for-the-overcloud#Replacing_Ceph_Storage_Nodes [3] [stack@hci-director ~]$ openstack server list +--------------------------------------+-------------------------+--------+-----------------------+----------------+ | ID | Name | Status | Networks | Image Name | +--------------------------------------+-------------------------+--------+-----------------------+----------------+ | fc8686c1-a675-4c89-a508-cc1b34d5d220 | overcloud-controller-2 | ACTIVE | ctlplane=192.168.1.37 | overcloud-full | | 7c6ae5f3-7e18-4aa2-a1f8-53145647a3de | overcloud-osd-compute-2 | ACTIVE | ctlplane=192.168.1.30 | overcloud-full | | 851f76db-427c-42b3-8e0b-e8b4b19770f8 | overcloud-controller-0 | ACTIVE | ctlplane=192.168.1.33 | overcloud-full | | e2906507-6a06-4c4d-bd15-9f7de455e91d | overcloud-controller-1 | ACTIVE | ctlplane=192.168.1.29 | overcloud-full | | 0f93a712-b9eb-4f42-bc05-f2c8c2edfd81 | overcloud-osd-compute-0 | ACTIVE | ctlplane=192.168.1.32 | overcloud-full | | 8f266c17-ff39-422e-a935-effb219c7782 | overcloud-osd-compute-1 | ACTIVE | ctlplane=192.168.1.24 | overcloud-full | +--------------------------------------+-------------------------+--------+-----------------------+----------------+ [stack@hci-director ~]$ [stack@hci-director ~]$ openstack stack list +--------------------------------------+------------+-----------------+----------------------+----------------------+ | ID | Stack Name | Stack Status | Creation Time | Updated Time | +--------------------------------------+------------+-----------------+----------------------+----------------------+ | 23e7c364-7303-4af6-b54d-cfbf1b737680 | overcloud | UPDATE_COMPLETE | 2016-11-24T03:24:56Z | 2016-11-29T04:47:03Z | +--------------------------------------+------------+-----------------+----------------------+----------------------+ [stack@hci-director ~]$
As per shardy the following is the correct way to delete a node, even if it's from a custom role: `openstack overcloud node delete --stack $ID $node_id` The above should be run without any -r or -e options. I will test this next and update the bug.
I made a mistake in my testing with the IDs and comments 3,4,5 should be ignored.
I was able to delete a node but I had to provide the stack name "overcloud" and not the UUID. Here's an example of it working: 1. Identify the node ID [stack@hci-director ~]$ openstack server list | grep osd-compute-3 | 6b2a2e71-f9c8-4d5b-aaf8-dada97c90821 | overcloud-osd-compute-3 | ACTIVE | ctlplane=192.168.1.27 | overcloud-full | [stack@hci-director ~]$ 2. Start a Mistral workflow to delete the node by ID from the stack by name: [stack@hci-director ~]$ time openstack overcloud node delete --stack overcloud 6b2a2e71-f9c8-4d5b-aaf8-dada97c90821 deleting nodes [u'6b2a2e71-f9c8-4d5b-aaf8-dada97c90821'] from stack overcloud Started Mistral Workflow. Execution ID: 396f123d-df5b-4f37-b137-83d33969b52b real 1m50.662s user 0m0.563s sys 0m0.099s [stack@hci-director ~]$ 3. Observe that the stack is being updated: [stack@hci-director ~]$ heat stack-list WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead +--------------------------------------+------------+--------------------+----------------------+----------------------+ | id | stack_name | stack_status | creation_time | updated_time | +--------------------------------------+------------+--------------------+----------------------+----------------------+ | 23e7c364-7303-4af6-b54d-cfbf1b737680 | overcloud | UPDATE_IN_PROGRESS | 2016-11-24T03:24:56Z | 2016-11-30T17:16:48Z | +--------------------------------------+------------+--------------------+----------------------+----------------------+ [stack@hci-director ~]$ 4. Observe that the update is complete: [stack@hci-director ~]$ heat stack-list WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead +--------------------------------------+------------+-----------------+----------------------+----------------------+ | id | stack_name | stack_status | creation_time | updated_time | +--------------------------------------+------------+-----------------+----------------------+----------------------+ | 23e7c364-7303-4af6-b54d-cfbf1b737680 | overcloud | UPDATE_COMPLETE | 2016-11-24T03:24:56Z | 2016-11-30T17:16:48Z | +--------------------------------------+------------+-----------------+----------------------+----------------------+ [stack@hci-director ~]$ 5. Observe that the node was deleted as desired. [stack@hci-director ~]$ nova list +--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+ | 851f76db-427c-42b3-8e0b-e8b4b19770f8 | overcloud-controller-0 | ACTIVE | - | Running | ctlplane=192.168.1.33 | | e2906507-6a06-4c4d-bd15-9f7de455e91d | overcloud-controller-1 | ACTIVE | - | Running | ctlplane=192.168.1.29 | | fc8686c1-a675-4c89-a508-cc1b34d5d220 | overcloud-controller-2 | ACTIVE | - | Running | ctlplane=192.168.1.37 | | 0f93a712-b9eb-4f42-bc05-f2c8c2edfd81 | overcloud-osd-compute-0 | ACTIVE | - | Running | ctlplane=192.168.1.32 | | 8f266c17-ff39-422e-a935-effb219c7782 | overcloud-osd-compute-1 | ACTIVE | - | Running | ctlplane=192.168.1.24 | | 7c6ae5f3-7e18-4aa2-a1f8-53145647a3de | overcloud-osd-compute-2 | ACTIVE | - | Running | ctlplane=192.168.1.30 | +--------------------------------------+-------------------------+--------+------------+-------------+-----------------------+ [stack@hci-director ~]$ Warning: if you identify the stack by its UUID, as I did originally, you may run into the issue below. Note in the first line of output from the command below, that it correctly identifies the node number and the stack number but is unable to find the environment by name: "Environment not found [name=23e7c364-7303-4af6-b54d-cfbf1b737680]". So I think this is a minor bug and I'll update the title as the workaround is simple. [stack@hci-director ~]$ nova_id=$(openstack server list | grep compute-3 | awk {'print $2'} | egrep -vi 'id|^$') [stack@hci-director ~]$ stack_id=$(openstack stack list | awk {'print $2'} | egrep -vi 'id|^$') [stack@hci-director ~]$ time openstack overcloud node delete --stack $stack_id $nova_id deleting nodes [u'6b2a2e71-f9c8-4d5b-aaf8-dada97c90821'] from stack 23e7c364-7303-4af6-b54d-cfbf1b737680 Started Mistral Workflow. Execution ID: 4864b1df-a170-4d51-b411-79f839d11ecd {u'execution': {u'id': u'4864b1df-a170-4d51-b411-79f839d11ecd', u'input': {u'container': u'23e7c364-7303-4af6-b54d-cfbf1b737680', u'nodes': [u'6b2a2e71-f9c8-4d5b-aaf8-dada97c90821'], u'queue_name': u'b0c40c06-be37-402d-9636-6071ba3e28b2', u'timeout': 240}, u'name': u'tripleo.scale.v1.delete_node', u'params': {}, u'spec': {u'description': u'deletes given overcloud nodes and updates the stack', u'input': [u'container', u'nodes', {u'timeout': 240}, {u'queue_name': u'tripleo'}], u'name': u'delete_node', u'tasks': {u'delete_node': {u'action': u'tripleo.scale.delete_node nodes=<% $.nodes %> timeout=<% $.timeout %> container=<% $.container %>', u'name': u'delete_node', u'on-error': u'set_delete_node_failed', u'on-success': u'send_message', u'type': u'direct', u'version': u'2.0'}, u'send_message': {u'action': u'zaqar.queue_post', u'input': {u'messages': {u'body': {u'payload': {u'execution': u'<% execution() %>', u'message': u"<% $.get('message', '') %>", u'status': u"<% $.get('status', 'SUCCESS') %>"}, u'type': u'tripleo.scale.v1.delete_node'}}, u'queue_name': u'<% $.queue_name %>'}, u'name': u'send_message', u'retry': u'count=5 delay=1', u'type': u'direct', u'version': u'2.0'}, u'set_delete_node_failed': {u'name': u'set_delete_node_failed', u'on-success': u'send_message', u'publish': {u'message': u'<% task(delete_node).result %>', u'status': u'FAILED'}, u'type': u'direct', u'version': u'2.0'}}, u'version': u'2.0'}}, u'message': u"Failed to run action [action_ex_id=c2e44ffe-00fc-4131-b29c-981e33f50ea1, action_cls='<class 'mistral.actions.action_factory.ScaleDownAction'>', attributes='{}', params='{u'nodes': [u'6b2a2e71-f9c8-4d5b-aaf8-dada97c90821'], u'container': u'23e7c364-7303-4af6-b54d-cfbf1b737680', u'timeout': 240}']\n Environment not found [name=23e7c364-7303-4af6-b54d-cfbf1b737680]", u'status': u'FAILED'} real 1m39.169s user 0m0.530s sys 0m0.104s [stack@hci-director ~]$
Because this is a duplicate of the following upstream bug, which already has a fix released in Ocata, I am marking this BZ as MODIFIED. https://bugs.launchpad.net/tripleo/+bug/1640933 Here is the fix from Ocata: https://review.openstack.org/#/c/398289/ If this is backported to Newton, then we could identify the fixed-in and set it to POST as a next step.
patch landed in stable/newton
Targeting bz to OSP 10 since it's in recent build
No release notes required for this bug fix. Flags set.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0234.html