Bug 1658331

Summary: Overcloud deployment fails not changing provisioning state or powering off baremetal servers
Product: Red Hat OpenStack Reporter: bjacot
Component: openstack-ironicAssignee: RHOS Maint <rhos-maint>
Status: CLOSED NOTABUG QA Contact: bjacot
Severity: high Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: bfournie, dtantsur, jkreger, mburns
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-03 13:46:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description bjacot 2018-12-11 19:02:10 UTC
Description of problem:
I am noticing when an OC deployment fails the baremetal node list Provisioning state and Power state is not getting changed.  In the test I ran I received a deployment error DeploymentError: Heat Stack create failed. but the power state of the baremetal node is "power on" and the provisioning state is "deploying"

This is causing the enduser not to be able to delete the failed OC.

Version-Release number of selected component (if applicable):

version: core_puddle_version 2018-12-07.2
rpm -qa | grep openstack-ironic
openstack-ironic-common-11.1.1-0.20181012152841.el7ost.noarch

How reproducible:
noticing sometimes

Steps to Reproduce:
Note: These OC deployment steps are related to another task but got me to a failed OC deployment.
1. Deploy UC
2. Deploy Ironic nodes and introspect
3: follow these steps
   http://tripleo.org/install/advanced_deployment/ansible_deploy_interface.html
   A: Custom ansible playbooks steps 1-2
   B: Installing/update UC: steps 1-5
   Note: step 5 do this: "sudo chmod 777 /var/lib/ironic/ipa-ssh"
   C: skip Enabling Temporary URL's not needed for OSP14
   D: Configure Nodes
   E: Editing Playbooks steps 1-2
   Note: step 1 change dest: "{{ tmp_rootfs_mount }}/etc/default/grub" --> path: "{{ tmp_rootfs_mount }}/etc/default/grub" in the grub.yaml
4. Prepare for any failed OC deployment
5: run OC deployment and it will fail.

Actual results:
  File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/deployment.py", line 106, in deploy_and_wait
    raise exceptions.DeploymentError("Heat Stack create failed.")
DeploymentError: Heat Stack create failed.

END return value: 1
(undercloud) [stack@undercloud-0 ~]$ openstack baremetal node list
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| UUID                                 | Name         | Instance UUID                        | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+
| f3ded335-a9b7-4aa4-8117-caebc67b9a66 | compute-0    | 0b54a343-8d71-44f7-a204-4f22a70c895b | power on    | deploying          | False       |
| 65d42111-a299-4bb8-8e12-33fba1d513a9 | controller-0 | 665b8f4a-dacf-4176-b138-e6035435fff7 | power on    | deploying          | False       |
+--------------------------------------+--------------+--------------------------------------+-------------+--------------------+-------------+

Expected results:
Expected the baremetal node to not be in deploying state if the OC deployment failed.

Additional info:

no workaroud

Comment 2 Bob Fournier 2018-12-11 22:32:57 UTC
Getting these DeployFailures in ironic-conductor.log

[conductor] ***************************************************************\nMETA: ran handlers\n\nTASK [add_host] ****************************************************************\ntask path: /var/lib/ironic/playbooks/add-ironic-nodes.yaml:4\ncreating host via \'add_host\': hostname=65d42111-a299-4bb8-8e12-33fba1d513a9\nchanged: [conductor] => (item={u\'ip\': u\'192.168.24.18\', u\'user\': u\'root\', u\'name\': u\'65d42111-a299-4bb8-8e12-33fba1d513a9\', u\'extra\': {u\'hardware_swift_object\': u\'extra_hardware-65d42111-a299-4bb8-8e12-33fba1d513a9\'}}) => {\n    "add_host": {\n        "groups": [\n            "ironic"\n        ], \n        "host_name": "65d42111-a299-4bb8-8e12-33fba1d513a9", \n        "host_vars": {\n            "ansible_host": "192.168.24.18", \n            "ansible_user": "root", \n            "group": "ironic", \n            "ironic_extra": {\n                "hardware_swift_object": "extra_hardware-65d42111-a299-4bb8-8e12-33fba1d513a9"\n            }\n        }\n    }, \n    "changed": true, \n    "item": {\n        "extra": {\n            "hardware_swift_object": "extra_hardware-65d42111-a299-4bb8-8e12-33fba1d513a9"\n        }, \n        "ip": "192.168.24.18", \n        "name": "65d42111-a299-4bb8-8e12-33fba1d513a9", \n        "user": "root"\n    }\n}\nMETA: ran handlers\nMETA: ran handlers\n\nPLAY [ironic] ******************************************************************\n\nTASK [Gathering Facts] *********************************************************\ntask path: /var/lib/ironic/playbooks/deploy.yaml:4\nUsing module file /usr/lib/python2.7/site-packages/ansible/modules/system/setup.py\n<192.168.24.18> ESTABLISH CONNECTION FOR USER: root on PORT 22 TO 192.168.24.18\nfatal: [65d42111-a299-4bb8-8e12-33fba1d513a9]: UNREACHABLE! => {\n    "changed": false, \n    "msg": "[Errno 13] Permission denied: u\'/var/lib/ironic/ipa-ssh\'", \n    "unreachable": true\n}\n\nPLAY RECAP *********************************************************************\n65d42111-a299-4bb8-8e12-33fba1d513a9 : ok=0    changed=0    unreachable=1    failed=0   \nconductor                  : ok=1    changed=1    unreachable=0    failed=0   \n\n'
Stderr: u' [WARNING]: Ignoring invalid attribute: state\n [WARNING]: Ignoring invalid attribute: path\n [WARNING]: Ignoring invalid attribute: line\n'
2018-12-11 10:24:01.867 1 ERROR ironic.drivers.modules.agent_base_vendor Traceback (most recent call last):
2018-12-11 10:24:01.867 1 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/agent_base_vendor.py", line 310, in heartbeat
2018-12-11 10:24:01.867 1 ERROR ironic.drivers.modules.agent_base_vendor     self.continue_deploy(task)
2018-12-11 10:24:01.867 1 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python2.7/site-packages/ironic_lib/metrics.py", line 60, in wrapped
2018-12-11 10:24:01.867 1 ERROR ironic.drivers.modules.agent_base_vendor     result = f(*args, **kwargs)
2018-12-11 10:24:01.867 1 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/ansible/deploy.py", line 564, in continue_deploy
2018-12-11 10:24:01.867 1 ERROR ironic.drivers.modules.agent_base_vendor     self._ansible_deploy(task, node_address)
2018-12-11 10:24:01.867 1 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/ansible/deploy.py", line 428, in _ansible_deploy
2018-12-11 10:24:01.867 1 ERROR ironic.drivers.modules.agent_base_vendor     _run_playbook(node, playbook, extra_vars, key)
2018-12-11 10:24:01.867 1 ERROR ironic.drivers.modules.agent_base_vendor   File "/usr/lib/python2.7/site-packages/ironic/drivers/modules/ansible/deploy.py", line 160, in _run_playbook
2018-12-11 10:24:01.867 1 ERROR ironic.drivers.modules.agent_base_vendor     raise exception.InstanceDeployFailure(reason=e)
2018-12-11 10:24:01.867 1 ERROR ironic.drivers.modules.agent_base_vendor InstanceDeployFailure: Failed to deploy instance: Unexpected error while running command.

Comment 3 bjacot 2018-12-12 16:11:51 UTC
I feel when I make the change to the grub.yaml file, set the --extra parma and run the OC its triggering the ironic node not to switch from deploying state to active during the deployment and it times out.  When I run this with out making any changes to the grub.yaml file it appears to change correctly through out the OC deployment.

nova-compute.log
2018-12-11 18:53:05.217 1 DEBUG nova.virt.ironic.driver [-] [instance: 428de4e5-bfd1-42bf-8f21-9236ab1351bb] Still waiting for ironic node 5fb90416-7382-4418-94c2-7e07e8d264be to become ACTIVE: power_state="power on", target_power_state=None, provision_state="deploying", target_provision_state="active" _log_ironic_polling /usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py:131

2018-12-11 18:53:06.376 1 DEBUG nova.virt.ironic.driver [-] [instance: 4f548d20-0d3a-43a6-9c87-f5b9108ef9fe] Still waiting for ironic node 6c1d076a-8310-4136-b6a1-0b32fdeca417 to become ACTIVE: power_state="power on", target_power_state=None, provision_state="deploying", target_provision_state="active" _log_ironic_polling /usr/lib/python2.7/site-packages/nova/virt/ironic/driver.py:131

Comment 4 bjacot 2018-12-12 18:03:44 UTC
Workaround:
Director issue these commands:
#sudo docker restart ironic_conductor
#source stackrc && openstack baremetal node undeploy controller-0
#source stackrc && openstack baremetal node undeploy compute-0

Comment 5 bjacot 2018-12-13 14:47:51 UTC
Another workaround which is faster
#sudo docker restart ironic_conductor
#source stackrc && openstack stack delete -y overcloud

Comment 6 bjacot 2018-12-17 13:45:40 UTC
Update to trigger failure.  In step 3 "E" when modifying the playbook there needs to be incorrect spacing.  This will trigger the failure and cause the OC deploy to fail.

Comment 7 Bob Fournier 2019-01-03 13:46:04 UTC
Per Comment 17, this is an incorrect grub setting in the playbook, not a bug, and not something that can be detected.  Closing.