Bug 1857365

Summary: scale down of nodes is failing if all nodes are unreachable
Product: Red Hat OpenStack Reporter: Alex Schultz <aschultz>
Component: openstack-tripleo-heat-templatesAssignee: Emilien Macchi <emacchi>
Status: CLOSED ERRATA QA Contact: David Rosenfeld <drosenfe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 16.1 (Train)CC: emacchi, mburns, psahoo, scohen, smalleni, spower
Target Milestone: z1Keywords: Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.3.2-0.20200616081533.396affd Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-27 15:19:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Schultz 2020-07-15 17:36:39 UTC
Description of problem:
While troubleshooting Bug 1857298, we identified issues with the scale down playbook including the common playbook that is not properly skipped if all the nodes are unavailable. Additionally we have some expections in the common playbook that all nodes targeted will be available when they may not be on scale down. Additionally the dynamic any_error_fatal setting does not appear to be honored.

Version-Release number of selected component (if applicable):
python3-tripleo-common-11.3.3-0.20200611110655.f7715be.el8ost.noarch
openstack-tripleo-common-11.3.3-0.20200611110655.f7715be.el8ost.noarch
openstack-tripleo-validations-11.3.2-0.20200611115252.08f469d.el8ost.noarch
ansible-tripleo-ipa-0.2.1-0.20200611104546.c22fc8d.el8ost.noarch
ansible-tripleo-ipsec-9.2.1-0.20200311073016.0c8693c.el8ost.noarch
puppet-tripleo-11.5.0-0.20200616033427.8ff1c6a.el8ost.noarch
openstack-tripleo-puppet-elements-11.2.2-0.20200527003426.226ce95.el8ost.noarch
python3-tripleoclient-12.3.2-0.20200615103427.6f877f6.el8ost.noarch
ansible-role-tripleo-modify-image-1.2.1-0.20200527233426.bc21900.el8ost.noarch
openstack-tripleo-common-containers-11.3.3-0.20200611110655.f7715be.el8ost.noarch
openstack-tripleo-heat-templates-11.3.2-0.20200616081529.396affd.el8ost.noarch
tripleo-ansible-0.5.1-0.20200611113655.34b8fcc.el8ost.noarch
python3-tripleoclient-heat-installer-12.3.2-0.20200615103427.6f877f6.el8ost.noarch
openstack-tripleo-image-elements-10.6.2-0.20200528043425.7dc0fa1.el8ost.noarch

How reproducible:
Reproducible when all nodes being scaled down are unavailable.

Steps to Reproduce:
1. deploy overcloud
2. turn off compute node
3. attempt to scale down compute node

Actual results:
Failure during scale down action execution

Expected results:
Down nodes should be ignored.


Additional info:

Comment 1 Alex Schultz 2020-07-17 14:00:16 UTC
*** Bug 1857004 has been marked as a duplicate of this bug. ***

Comment 3 spower 2020-07-22 09:41:54 UTC
removing Blocker flag, this has already been approved for 16.1.1

Comment 6 David Rosenfeld 2020-07-30 16:47:26 UTC
Had deployment with two compute nodes. Shut each node. Deletion of both nodes was successful:

TASK [Stop nova-compute healthcheck container] *********************************
Thursday 30 July 2020  12:07:41 -0400 (0:00:04.201)       0:02:50.683 ********* 
fatal: [compute-1]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"192.168.24.30\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.30 port 22: No route to host\r\n", "skip_reason": "Host compute-1 is unreachable", "unreachable": true}

fatal: [compute-2]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"192.168.24.54\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.54 port 22: No route to host\r\n", "skip_reason": "Host compute-2 is unreachable", "unreachable": true}

TASK [Stop nova-compute container] *********************************************
Thursday 30 July 2020  12:10:01 -0400 (0:02:20.489)       0:05:11.173 ********* 
fatal: [compute-2]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"192.168.24.54\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.54 port 22: No route to host\r\n", "skip_reason": "Host compute-2 is unreachable", "unreachable": true}

fatal: [compute-1]: UNREACHABLE! => {"changed": false, "msg": "Data could not be sent to remote host \"192.168.24.30\". Make sure this host can be reached over ssh: ssh: connect to host 192.168.24.30 port 22: No route to host\r\n", "skip_reason": "Host compute-1 is unreachable", "unreachable": true}

TASK [Delete nova-compute service] *********************************************
Thursday 30 July 2020  12:12:21 -0400 (0:02:19.815)       0:07:30.989 ********* 
changed: [compute-2]
changed: [compute-1]

TASK [fail] ********************************************************************
Thursday 30 July 2020  12:12:26 -0400 (0:00:05.145)       0:07:36.134 ********* 
skipping: [compute-1]
skipping: [compute-2]

PLAY RECAP *********************************************************************
compute-1                  : ok=9    changed=2    unreachable=3    failed=0    skipped=5    rescued=0    ignored=0   
compute-2                  : ok=8    changed=2    unreachable=3    failed=0    skipped=5    rescued=0    ignored=0   

Thursday 30 July 2020  12:12:26 -0400 (0:00:00.110)       0:07:36.245 ********* 
=============================================================================== 

Ansible passed.


Previous to fix same test showed: Ansible failed, check log at /var/lib/mistral/overcloud/ansible.log.

Comment 9 errata-xmlrpc 2020-08-27 15:19:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3542

Comment 10 Alex Schultz 2020-09-09 13:22:56 UTC
*** Bug 1856922 has been marked as a duplicate of this bug. ***