Bug 1437016

Summary: tripleo client stuck in IN_PROGRESS in overcloud update run
Product: Red Hat OpenStack Reporter: Martin Schuppert <mschuppe>
Component: openstack-tripleo-commonAssignee: Julie Pichon <jpichon>
Status: CLOSED ERRATA QA Contact: Julie Pichon <jpichon>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: afariasa, aschultz, cpaquin, dbecker, hbrock, hjensas, jjoyce, jpichon, jslagle, mburns, mburrows, mschuppe, pneedle, rhel-osp-director-maint, slinaber, therve, yprokule
Target Milestone: asyncKeywords: OtherQA, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-common-5.4.1-6.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1457208 (view as bug list) Environment:
Last Closed: 2019-02-14 15:13:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1434509, 1457208, 1520109    

Description Martin Schuppert 2017-03-29 09:47:21 UTC
Description of problem:

This BZ is a spinoff from https://bugzilla.redhat.com/show_bug.cgi?id=1432571 in 
order to precisely track an issue that we were able to reproduce internally as well. Namely, 
when running:
openstack overcloud update stack -i overcloud

When trigger an overcloud minor update, the tripleoclient is stuck in IN_PROGRESS and 
will timeout after the 4h default timeout, even the update step went through the overcloud node.

Reproduced with:

1) made the stack fail due as of https://bugzilla.redhat.com/show_bug.cgi?id=1416228
Note: right now it is not known if a failed updated stack is needed, but it was the steps 
which lead to successfully reproduce the issue

[stack@undercloud-0 ~]$ ./overcloud_update_plan_only.sh
Removing the current plan files
Uploading new plan files
Started Mistral Workflow. Execution ID: 6c981c34-8d4d-4761-9a16-08e3d789b527
Plan updated
Deploying templates in the directory /tmp/tripleoclient-ERSxeZ/tripleo-heat-templates
Started Mistral Workflow. Execution ID: a889c346-2e4a-4dfe-9409-36e91b1d8773
Overcloud Endpoint: http://10.0.0.103:5000/v2.0
Overcloud Deployed
[stack@undercloud-0 ~]$ openstack overcloud update stack -i overcloud
starting package update on stack overcloud
WAITING
on_breakpoint: [u'controller-1', u'controller-2', u'controller-0', u'compute-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear d0af588b-2ed2-46e8-89bd-466111526b8b), no=cancel update, C-c=quit interactive mode:
IN_PROGRESS
IN_PROGRESS
FAILED
update finished with status FAILED
Stack update failed.

=> now we have the overcloud stack in update failed state what we wanted

2) fix the yum_update.sh again to not fail on the compute when run
[stack@undercloud-0 ~]$ ./overcloud_update_plan_only.sh
Removing the current plan files
Uploading new plan files
Started Mistral Workflow. Execution ID: e0ac2916-d440-4419-8647-0578eeaf5084
Plan updated
Deploying templates in the directory /tmp/tripleoclient-kyRZCA/tripleo-heat-templates
Started Mistral Workflow. Execution ID: 789cf917-e326-4dca-95d5-742ff136550c
Overcloud Endpoint: http://10.0.0.103:5000/v2.0
Overcloud Deployed

3) when we now trigger an overcloud update, we clear the 1st breack point, which is
the node where we previously failed:

[stack@undercloud-0 ~]$ openstack overcloud update stack -i overcloud
starting package update on stack overcloud
WAITING
not_started: [u'controller-1', u'controller-2', u'controller-0']
on_breakpoint: [u'compute-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 99cdfbce-30bd-4c73-8055-050a0af48e56), no=cancel update, C-c=quit interactive mode:
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
...

From the logs of the compute service we see that update on the compute went through
and also that the compute signaled suggessfull back to heat-cfn

It was identified that there is an issue with hooks aren't retrieved properly in heat-client
https://bugzilla.redhat.com/show_bug.cgi?id=1436712

~~~
We used hooks on the UpdateDeployment resources while making a minor update. The first 
and/or second hooks generally worked fine, but we were waiting forever on the last 2. 
The heat hook-poll -n5 overcloud command didn't return anything.

It turns out the client detection of hooks is broken. We don't set the stack_name of 
the event correctly, and as we use the stack_name to identify the event, we can't 
detect the hooks correctly.

This affects the heat command line client (openstack stack hook poll / heat hook-poll).
~~~

While the tripleoclient is stuck in IN_PROGRESS, we were able to move the update forward by 
clearing the next hook for the next  OS::TripleO::Controller resource_type like 
"openstack stack hook clear --pre-update 5045a96a-3399-4491-9961-d26e5fc93830 UpdateDeployment" 
when we see from the logs that the update went through. After teh update went through all nodes, 
the tripleoclient ended with update complete:
...
IN_PROGRESS
IN_PROGRESS
COMPLETE
update finished with status COMPLETE

It would seem that tripleoclient fails to detect pending hooks as well. But the implementation 
is completely different from the heatclient one, so the bug source ought to be different as well.
It's also worth noting that the server side seems fine on that aspect.


Version-Release number of selected component (if applicable):
python-tripleoclient-5.4.1-1.el7ost.noarch

Comment 1 Red Hat Bugzilla Rules Engine 2017-03-29 09:47:32 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 5 Julie Pichon 2017-03-31 10:21:42 UTC
Martin, thank you for the environment information. There are multiple people connected to it and a stack update in progress, so I assume it is used as part of other bugs as well? Despite my best efforts I've been unable to reproduce the bug exactly so it's difficult for me to confirm if the patch upstream will fix this particular case. Is it possible to apply the patch in your lab or another test environment where the issue has been confirmed?

Comment 6 Martin Schuppert 2017-03-31 14:03:49 UTC
1st run:

* put stack in failed state as mentioned in description
[stack@undercloud-0 ~]$ ./overcloud_update_plan_only.sh
Removing the current plan files
Uploading new plan files
Started Mistral Workflow. Execution ID: ec001f0d-49b7-4afb-b3cf-e4d9a3a5f287
Plan updated
Deploying templates in the directory /tmp/tripleoclient-vj1ILn/tripleo-heat-templates
Started Mistral Workflow. Execution ID: 4b506ad6-f6ae-4396-816d-d8fbe9f9c0b0
Overcloud Endpoint: http://10.0.0.103:5000/v2.0
Overcloud Deployed
[stack@undercloud-0 ~]$ heat stack-list
WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead
+--------------------------------------+------------+---------------+----------------------+----------------------+
| id                                   | stack_name | stack_status  | creation_time        | updated_time         |
+--------------------------------------+------------+---------------+----------------------+----------------------+
| 0add0f72-8693-424b-bf28-06b11402340d | overcloud  | UPDATE_FAILED | 2017-03-18T23:18:30Z | 2017-03-31T11:10:57Z |
+--------------------------------------+------------+---------------+----------------------+----------------------+

* applied https://review.openstack.org/#/c/451725/3/tripleo_common/_stack_update.py

[stack@undercloud-0 ~]$ diff -u _stack_update.py _stack_update.py-fix
--- _stack_update.py    2017-03-31 11:50:01.356143531 +0000
+++ _stack_update.py-fix        2017-03-31 11:03:14.169127182 +0000
@@ -160,9 +160,9 @@
                     state = 'on_breakpoint'
                 elif ev.resource_status_reason == hook_clear_reason:
                     state = 'in_progress'
-                elif ev.resource_status == 'UPDATE_IN_PROGRESS':
+                elif ev.resource_status in ('CREATE_IN_PROGRESS', 'UPDATE_IN_PROGRESS'):
                     state = 'in_progress'
-                elif ev.resource_status == 'UPDATE_COMPLETE':
+                elif ev.resource_status in ('CREATE_COMPLETE', 'UPDATE_COMPLETE'):
                     state = 'completed'
             resources[state][res.physical_resource_id] = res

* update was successful:

[stack@undercloud-0 ~]$ openstack overcloud update stack -i overcloud
starting package update on stack overcloud
WAITING
not_started: [u'controller-1']
on_breakpoint: [u'compute-0', u'controller-2', u'controller-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 1d95309a-cc90-4e1a-b3ae-5168c5aef841), no=cancel update, C-c=quit interactive mode: compute-0
IN_PROGRESS
IN_PROGRESS
WAITING
completed: [u'compute-0']
on_breakpoint: [u'controller-1', u'controller-2', u'controller-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 1d95309a-cc90-4e1a-b3ae-5168c5aef841), no=cancel update, C-c=quit interactive mode: controller-0
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
WAITING
completed: [u'compute-0', u'controller-0']
on_breakpoint: [u'controller-1', u'controller-2']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 8963c6f9-ac10-4937-adc7-62114739a845), no=cancel update, C-c=quit interactive mode:
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
WAITING
completed: [u'controller-2', u'compute-0', u'controller-0']
on_breakpoint: [u'controller-1']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear 0e4c9349-9a54-4371-8fc6-f4f0b9428744), no=cancel update, C-c=quit interactive mode:
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
...
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
COMPLETE
update finished with status COMPLETE

* a second test run was also successful

* in a 3rd run I reverted the patch and the update is stuck again:
[stack@undercloud-0 ~]$ openstack overcloud update stack -i overcloud
starting package update on stack overcloud
WAITING
not_started: [u'controller-0']
on_breakpoint: [u'controller-1', u'controller-2', u'compute-0']
Breakpoint reached, continue? Regexp or Enter=proceed (will clear c820338e-79c5-4f13-8a13-0646911d07a9), no=cancel update, C-c=quit interactive mode: compute-0
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
...

Comment 14 Julie Pichon 2017-05-09 15:23:53 UTC
Confirmed that the fix is included in the "Fixed in Version" rpm and completed an update successfully locally. A build containing this fix was also confirmed to resolve the problem in environments that displayed the issue, cf. comment 6.

$ rpm -qa openstack-tripleo-common
openstack-tripleo-common-5.4.1-6.el7ost.noarch

Comment 16 errata-xmlrpc 2017-05-17 12:24:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1242