Description of problem: This BZ is a spinoff from https://bugzilla.redhat.com/show_bug.cgi?id=1432571 in order to precisely track an issue that we were able to reproduce internally as well. Namely, when running: openstack overcloud update stack -i overcloud When trigger an overcloud minor update, the tripleoclient is stuck in IN_PROGRESS and will timeout after the 4h default timeout, even the update step went through the overcloud node. Reproduced with: 1) made the stack fail due as of https://bugzilla.redhat.com/show_bug.cgi?id=1416228 Note: right now it is not known if a failed updated stack is needed, but it was the steps which lead to successfully reproduce the issue [stack@undercloud-0 ~]$ ./overcloud_update_plan_only.sh Removing the current plan files Uploading new plan files Started Mistral Workflow. Execution ID: 6c981c34-8d4d-4761-9a16-08e3d789b527 Plan updated Deploying templates in the directory /tmp/tripleoclient-ERSxeZ/tripleo-heat-templates Started Mistral Workflow. Execution ID: a889c346-2e4a-4dfe-9409-36e91b1d8773 Overcloud Endpoint: http://10.0.0.103:5000/v2.0 Overcloud Deployed [stack@undercloud-0 ~]$ openstack overcloud update stack -i overcloud starting package update on stack overcloud WAITING on_breakpoint: [u'controller-1', u'controller-2', u'controller-0', u'compute-0'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear d0af588b-2ed2-46e8-89bd-466111526b8b), no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS FAILED update finished with status FAILED Stack update failed. => now we have the overcloud stack in update failed state what we wanted 2) fix the yum_update.sh again to not fail on the compute when run [stack@undercloud-0 ~]$ ./overcloud_update_plan_only.sh Removing the current plan files Uploading new plan files Started Mistral Workflow. Execution ID: e0ac2916-d440-4419-8647-0578eeaf5084 Plan updated Deploying templates in the directory /tmp/tripleoclient-kyRZCA/tripleo-heat-templates Started Mistral Workflow. Execution ID: 789cf917-e326-4dca-95d5-742ff136550c Overcloud Endpoint: http://10.0.0.103:5000/v2.0 Overcloud Deployed 3) when we now trigger an overcloud update, we clear the 1st breack point, which is the node where we previously failed: [stack@undercloud-0 ~]$ openstack overcloud update stack -i overcloud starting package update on stack overcloud WAITING not_started: [u'controller-1', u'controller-2', u'controller-0'] on_breakpoint: [u'compute-0'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear 99cdfbce-30bd-4c73-8055-050a0af48e56), no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS ... From the logs of the compute service we see that update on the compute went through and also that the compute signaled suggessfull back to heat-cfn It was identified that there is an issue with hooks aren't retrieved properly in heat-client https://bugzilla.redhat.com/show_bug.cgi?id=1436712 ~~~ We used hooks on the UpdateDeployment resources while making a minor update. The first and/or second hooks generally worked fine, but we were waiting forever on the last 2. The heat hook-poll -n5 overcloud command didn't return anything. It turns out the client detection of hooks is broken. We don't set the stack_name of the event correctly, and as we use the stack_name to identify the event, we can't detect the hooks correctly. This affects the heat command line client (openstack stack hook poll / heat hook-poll). ~~~ While the tripleoclient is stuck in IN_PROGRESS, we were able to move the update forward by clearing the next hook for the next OS::TripleO::Controller resource_type like "openstack stack hook clear --pre-update 5045a96a-3399-4491-9961-d26e5fc93830 UpdateDeployment" when we see from the logs that the update went through. After teh update went through all nodes, the tripleoclient ended with update complete: ... IN_PROGRESS IN_PROGRESS COMPLETE update finished with status COMPLETE It would seem that tripleoclient fails to detect pending hooks as well. But the implementation is completely different from the heatclient one, so the bug source ought to be different as well. It's also worth noting that the server side seems fine on that aspect. Version-Release number of selected component (if applicable): python-tripleoclient-5.4.1-1.el7ost.noarch
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.
It looks like the main logic behind this lives in tripleo-common: https://github.com/openstack/python-tripleoclient/blob/stable/newton/tripleoclient/v1/overcloud_update.py https://github.com/openstack/tripleo-common/blob/stable/newton/tripleo_common/update.py https://github.com/openstack/tripleo-common/blob/stable/newton/tripleo_common/_stack_update.py
Martin, thank you for the environment information. There are multiple people connected to it and a stack update in progress, so I assume it is used as part of other bugs as well? Despite my best efforts I've been unable to reproduce the bug exactly so it's difficult for me to confirm if the patch upstream will fix this particular case. Is it possible to apply the patch in your lab or another test environment where the issue has been confirmed?
1st run: * put stack in failed state as mentioned in description [stack@undercloud-0 ~]$ ./overcloud_update_plan_only.sh Removing the current plan files Uploading new plan files Started Mistral Workflow. Execution ID: ec001f0d-49b7-4afb-b3cf-e4d9a3a5f287 Plan updated Deploying templates in the directory /tmp/tripleoclient-vj1ILn/tripleo-heat-templates Started Mistral Workflow. Execution ID: 4b506ad6-f6ae-4396-816d-d8fbe9f9c0b0 Overcloud Endpoint: http://10.0.0.103:5000/v2.0 Overcloud Deployed [stack@undercloud-0 ~]$ heat stack-list WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead +--------------------------------------+------------+---------------+----------------------+----------------------+ | id | stack_name | stack_status | creation_time | updated_time | +--------------------------------------+------------+---------------+----------------------+----------------------+ | 0add0f72-8693-424b-bf28-06b11402340d | overcloud | UPDATE_FAILED | 2017-03-18T23:18:30Z | 2017-03-31T11:10:57Z | +--------------------------------------+------------+---------------+----------------------+----------------------+ * applied https://review.openstack.org/#/c/451725/3/tripleo_common/_stack_update.py [stack@undercloud-0 ~]$ diff -u _stack_update.py _stack_update.py-fix --- _stack_update.py 2017-03-31 11:50:01.356143531 +0000 +++ _stack_update.py-fix 2017-03-31 11:03:14.169127182 +0000 @@ -160,9 +160,9 @@ state = 'on_breakpoint' elif ev.resource_status_reason == hook_clear_reason: state = 'in_progress' - elif ev.resource_status == 'UPDATE_IN_PROGRESS': + elif ev.resource_status in ('CREATE_IN_PROGRESS', 'UPDATE_IN_PROGRESS'): state = 'in_progress' - elif ev.resource_status == 'UPDATE_COMPLETE': + elif ev.resource_status in ('CREATE_COMPLETE', 'UPDATE_COMPLETE'): state = 'completed' resources[state][res.physical_resource_id] = res * update was successful: [stack@undercloud-0 ~]$ openstack overcloud update stack -i overcloud starting package update on stack overcloud WAITING not_started: [u'controller-1'] on_breakpoint: [u'compute-0', u'controller-2', u'controller-0'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear 1d95309a-cc90-4e1a-b3ae-5168c5aef841), no=cancel update, C-c=quit interactive mode: compute-0 IN_PROGRESS IN_PROGRESS WAITING completed: [u'compute-0'] on_breakpoint: [u'controller-1', u'controller-2', u'controller-0'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear 1d95309a-cc90-4e1a-b3ae-5168c5aef841), no=cancel update, C-c=quit interactive mode: controller-0 IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS WAITING completed: [u'compute-0', u'controller-0'] on_breakpoint: [u'controller-1', u'controller-2'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear 8963c6f9-ac10-4937-adc7-62114739a845), no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS WAITING completed: [u'controller-2', u'compute-0', u'controller-0'] on_breakpoint: [u'controller-1'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear 0e4c9349-9a54-4371-8fc6-f4f0b9428744), no=cancel update, C-c=quit interactive mode: IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS ... IN_PROGRESS IN_PROGRESS IN_PROGRESS COMPLETE update finished with status COMPLETE * a second test run was also successful * in a 3rd run I reverted the patch and the update is stuck again: [stack@undercloud-0 ~]$ openstack overcloud update stack -i overcloud starting package update on stack overcloud WAITING not_started: [u'controller-0'] on_breakpoint: [u'controller-1', u'controller-2', u'compute-0'] Breakpoint reached, continue? Regexp or Enter=proceed (will clear c820338e-79c5-4f13-8a13-0646911d07a9), no=cancel update, C-c=quit interactive mode: compute-0 IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS ...
Confirmed that the fix is included in the "Fixed in Version" rpm and completed an update successfully locally. A build containing this fix was also confirmed to resolve the problem in environments that displayed the issue, cf. comment 6. $ rpm -qa openstack-tripleo-common openstack-tripleo-common-5.4.1-6.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1242