Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1437016 - tripleo client stuck in IN_PROGRESS in overcloud update run
tripleo client stuck in IN_PROGRESS in overcloud update run
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common (Show other bugs)
13.0 (Queens)
Unspecified Unspecified
high Severity high
: async
: 13.0 (Queens)
Assigned To: Julie Pichon
Julie Pichon
: OtherQA, ZStream
Depends On:
Blocks: 1434509 1457208 1520109
  Show dependency treegraph
 
Reported: 2017-03-29 05:47 EDT by Martin Schuppert
Modified: 2018-10-20 00:02 EDT (History)
16 users (show)

See Also:
Fixed In Version: openstack-tripleo-common-5.4.1-6.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1457208 (view as bug list)
Environment:
Last Closed: 2017-05-17 08:24:56 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1677548 None None None 2017-03-30 07:04 EDT
Red Hat Knowledge Base (Solution) 3113281 None None None 2017-07-13 04:39 EDT
OpenStack gerrit 452626 None None None 2017-04-03 04:11 EDT
Red Hat Product Errata RHSA-2017:1242 normal SHIPPED_LIVE Important: Red Hat OpenStack Platform director security update 2017-05-17 12:19:05 EDT

  None (edit)
Description Martin Schuppert 2017-03-29 05:47:21 EDT
Description of problem:

This BZ is a spinoff from https://bugzilla.redhat.com/show_bug.cgi?id=1432571 in 
order to precisely track an issue that we were able to reproduce internally as well. Namely, 
when running:
openstack overcloud update stack -i overcloud

When trigger an overcloud minor update, the tripleoclient is stuck in IN_PROGRESS and 
will timeout after the 4h default timeout, even the update step went through the overcloud node.

Reproduced with:

1) made the stack fail due as of https://bugzilla.redhat.com/show_bug.cgi?id=1416228
Note: right now it is not known if a failed updated stack is needed, but it was the steps 
which lead to successfully reproduce the issue

[stack@undercloud-0 ~]$ ./overcloud_update_plan_only.sh
Removing the current plan files
Uploading new plan files
Started Mistral Workflow. Execution ID: 6c981c34-8d4d-4761-9a16-08e3d789b527
Plan updated
Deploying templates in the directory /tmp/tripleoclient-ERSxeZ/tripleo-heat-templates
Started Mistral Workflow. Execution ID: a889c346-2e4a-4dfe-9409-36e91b1d8773
Overcloud Endpoint: http://10.0.0.103:5000/v2.0
Overcloud Deployed
[stack@undercloud-0 ~]$ openstack overcloud update stack -i overcloud
starting package update on stack overcloud
WAITING
on_breakpoint: [u'controller-1', u'controller-2', u'controller-0', u'compute-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear d0af588b-2ed2-46e8-89bd-466111526b8b), no=cancel update, C-c=quit interactive mode:
IN_PROGRESS
IN_PROGRESS
FAILED
update finished with status FAILED
Stack update failed.

=> now we have the overcloud stack in update failed state what we wanted

2) fix the yum_update.sh again to not fail on the compute when run
[stack@undercloud-0 ~]$ ./overcloud_update_plan_only.sh
Removing the current plan files
Uploading new plan files
Started Mistral Workflow. Execution ID: e0ac2916-d440-4419-8647-0578eeaf5084
Plan updated
Deploying templates in the directory /tmp/tripleoclient-kyRZCA/tripleo-heat-templates
Started Mistral Workflow. Execution ID: 789cf917-e326-4dca-95d5-742ff136550c
Overcloud Endpoint: http://10.0.0.103:5000/v2.0
Overcloud Deployed

3) when we now trigger an overcloud update, we clear the 1st breack point, which is
the node where we previously failed:

[stack@undercloud-0 ~]$ openstack overcloud update stack -i overcloud
starting package update on stack overcloud
WAITING
not_started: [u'controller-1', u'controller-2', u'controller-0']
on_breakpoint: [u'compute-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 99cdfbce-30bd-4c73-8055-050a0af48e56), no=cancel update, C-c=quit interactive mode:
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
...

From the logs of the compute service we see that update on the compute went through
and also that the compute signaled suggessfull back to heat-cfn

It was identified that there is an issue with hooks aren't retrieved properly in heat-client
https://bugzilla.redhat.com/show_bug.cgi?id=1436712

~~~
We used hooks on the UpdateDeployment resources while making a minor update. The first 
and/or second hooks generally worked fine, but we were waiting forever on the last 2. 
The heat hook-poll -n5 overcloud command didn't return anything.

It turns out the client detection of hooks is broken. We don't set the stack_name of 
the event correctly, and as we use the stack_name to identify the event, we can't 
detect the hooks correctly.

This affects the heat command line client (openstack stack hook poll / heat hook-poll).
~~~

While the tripleoclient is stuck in IN_PROGRESS, we were able to move the update forward by 
clearing the next hook for the next  OS::TripleO::Controller resource_type like 
"openstack stack hook clear --pre-update 5045a96a-3399-4491-9961-d26e5fc93830 UpdateDeployment" 
when we see from the logs that the update went through. After teh update went through all nodes, 
the tripleoclient ended with update complete:
...
IN_PROGRESS
IN_PROGRESS
COMPLETE
update finished with status COMPLETE

It would seem that tripleoclient fails to detect pending hooks as well. But the implementation 
is completely different from the heatclient one, so the bug source ought to be different as well.
It's also worth noting that the server side seems fine on that aspect.


Version-Release number of selected component (if applicable):
python-tripleoclient-5.4.1-1.el7ost.noarch
Comment 1 Red Hat Bugzilla Rules Engine 2017-03-29 05:47:32 EDT
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.
Comment 5 Julie Pichon 2017-03-31 06:21:42 EDT
Martin, thank you for the environment information. There are multiple people connected to it and a stack update in progress, so I assume it is used as part of other bugs as well? Despite my best efforts I've been unable to reproduce the bug exactly so it's difficult for me to confirm if the patch upstream will fix this particular case. Is it possible to apply the patch in your lab or another test environment where the issue has been confirmed?
Comment 6 Martin Schuppert 2017-03-31 10:03:49 EDT
1st run:

* put stack in failed state as mentioned in description
[stack@undercloud-0 ~]$ ./overcloud_update_plan_only.sh
Removing the current plan files
Uploading new plan files
Started Mistral Workflow. Execution ID: ec001f0d-49b7-4afb-b3cf-e4d9a3a5f287
Plan updated
Deploying templates in the directory /tmp/tripleoclient-vj1ILn/tripleo-heat-templates
Started Mistral Workflow. Execution ID: 4b506ad6-f6ae-4396-816d-d8fbe9f9c0b0
Overcloud Endpoint: http://10.0.0.103:5000/v2.0
Overcloud Deployed
[stack@undercloud-0 ~]$ heat stack-list
WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead
+--------------------------------------+------------+---------------+----------------------+----------------------+
| id                                   | stack_name | stack_status  | creation_time        | updated_time         |
+--------------------------------------+------------+---------------+----------------------+----------------------+
| 0add0f72-8693-424b-bf28-06b11402340d | overcloud  | UPDATE_FAILED | 2017-03-18T23:18:30Z | 2017-03-31T11:10:57Z |
+--------------------------------------+------------+---------------+----------------------+----------------------+

* applied https://review.openstack.org/#/c/451725/3/tripleo_common/_stack_update.py

[stack@undercloud-0 ~]$ diff -u _stack_update.py _stack_update.py-fix
--- _stack_update.py    2017-03-31 11:50:01.356143531 +0000
+++ _stack_update.py-fix        2017-03-31 11:03:14.169127182 +0000
@@ -160,9 +160,9 @@
                     state = 'on_breakpoint'
                 elif ev.resource_status_reason == hook_clear_reason:
                     state = 'in_progress'
-                elif ev.resource_status == 'UPDATE_IN_PROGRESS':
+                elif ev.resource_status in ('CREATE_IN_PROGRESS', 'UPDATE_IN_PROGRESS'):
                     state = 'in_progress'
-                elif ev.resource_status == 'UPDATE_COMPLETE':
+                elif ev.resource_status in ('CREATE_COMPLETE', 'UPDATE_COMPLETE'):
                     state = 'completed'
             resources[state][res.physical_resource_id] = res

* update was successful:

[stack@undercloud-0 ~]$ openstack overcloud update stack -i overcloud
starting package update on stack overcloud
WAITING
not_started: [u'controller-1']
on_breakpoint: [u'compute-0', u'controller-2', u'controller-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 1d95309a-cc90-4e1a-b3ae-5168c5aef841), no=cancel update, C-c=quit interactive mode: compute-0
IN_PROGRESS
IN_PROGRESS
WAITING
completed: [u'compute-0']
on_breakpoint: [u'controller-1', u'controller-2', u'controller-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 1d95309a-cc90-4e1a-b3ae-5168c5aef841), no=cancel update, C-c=quit interactive mode: controller-0
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
WAITING
completed: [u'compute-0', u'controller-0']
on_breakpoint: [u'controller-1', u'controller-2']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 8963c6f9-ac10-4937-adc7-62114739a845), no=cancel update, C-c=quit interactive mode:
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
WAITING
completed: [u'controller-2', u'compute-0', u'controller-0']
on_breakpoint: [u'controller-1']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 0e4c9349-9a54-4371-8fc6-f4f0b9428744), no=cancel update, C-c=quit interactive mode:
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
...
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
COMPLETE
update finished with status COMPLETE

* a second test run was also successful

* in a 3rd run I reverted the patch and the update is stuck again:
[stack@undercloud-0 ~]$ openstack overcloud update stack -i overcloud
starting package update on stack overcloud
WAITING
not_started: [u'controller-0']
on_breakpoint: [u'controller-1', u'controller-2', u'compute-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear c820338e-79c5-4f13-8a13-0646911d07a9), no=cancel update, C-c=quit interactive mode: compute-0
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
...
Comment 14 Julie Pichon 2017-05-09 11:23:53 EDT
Confirmed that the fix is included in the "Fixed in Version" rpm and completed an update successfully locally. A build containing this fix was also confirmed to resolve the problem in environments that displayed the issue, cf. comment 6.

$ rpm -qa openstack-tripleo-common
openstack-tripleo-common-5.4.1-6.el7ost.noarch
Comment 16 errata-xmlrpc 2017-05-17 08:24:56 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1242

Note You need to log in before you can comment on or make changes to this bug.