1434509 – openstack overcloud update stack -i overcloud is stuck

Bug 1434509 - openstack overcloud update stack -i overcloud is stuck

Summary: openstack overcloud update stack -i overcloud is stuck

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-heat
Sub Component:
Version:	10.0 (Newton)
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Zane Bitter
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:	1436712 1437016 1457208 1520109
Blocks:
TreeView+	depends on / blocked

Reported:	2017-03-21 16:02 UTC by Michele Baldessari
Modified:	2020-06-11 13:34 UTC (History)
CC List:	12 users (show)
Fixed In Version:	python-heatclient-1.5.0-3.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-06-29 14:34:16 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Michele Baldessari 2017-03-21 16:02:47 UTC

Description of problem:
So this BZ is a spinoff from https://bugzilla.redhat.com/show_bug.cgi?id=1432571 in order to precisely track an issue that we were able to reproduce internally as well. Namely, when running:
openstack overcloud update stack -i overcloud

And giving compute-0 as a first node, when asked, we observe the following:
1. The yum_update.sh script runs to completion on compute-0
2. It tries to signal the completion back to heat on the undercloud
3. The heat-api-cfn service seems to return a 403
4. So heat-engine never gets aware of the completion and the command is stuck IN_PROGRESS until the general timeout kicks in

Version-Release number of selected component (if applicable):
openstack-heat-api-cfn-7.0.2-4.el7ost.noarch
openstack-heat-engine-7.0.2-4.el7ost.noarch
openstack-heat-common-7.0.2-4.el7ost.noarch
openstack-heat-api-7.0.2-4.el7ost.noarch

Note: that we did apply https://review.openstack.org/#/c/441972/ by hand and restarted heat-* services.

How reproducible:
Not always unfortunately.

Actual results:
client stuck in IN_PROGRESS

Expected results:
Move to the next node for the update

Additional info:
From the sosreport on the compute node we see this (sos_commands/logs/journalctl_--no-pager_--boot):

Mar 20 17:45:28 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:27,997] (heat-config) [INFO] deploy_signal_verb=POST
Mar 20 17:45:28 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:27,997] (heat-config) [DEBUG] Running /var/lib/heat-config/heat-config-script/df00f14a-bd21-49bc-9c2f-6c109381808c
Mar 20 17:45:28 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:28,907] (heat-config) [INFO] Started yum_update.sh on server 6ce9048b-721a-46bb-9ebb-6a4006e039e2 at Mon Mar 20 17:45:28 UTC 2017
Mar 20 17:45:28 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:28,907] (heat-config) [DEBUG]
Mar 20 17:45:28 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:28,907] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/df00f14a-bd21-49bc-9c2f-6c109381808c. [3]
Mar 20 17:45:28 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:28,912] (heat-config) [INFO] Completed /usr/libexec/heat-config/hooks/script
Mar 20 17:45:28 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:28,912] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/df00f14a-bd21-49bc-9c2f-6c109381808c.json < /var/lib/heat-config/deployed/df00f14a-bd21-49bc-9c2f-6c109381808c.notify.json
Mar 20 17:45:29 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:29,548] (heat-config) [INFO]
Mar 20 17:45:29 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:29,549] (heat-config) [DEBUG] [2017-03-20 17:45:29,485] (heat-config-notify) [DEBUG] Signaling to http://192.168.24.1:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3Ac52ae115ab604e99af5199def0a01b46%3Astacks%2Fovercloud-Compute-73ctxwtus77f-0-3hpvac54kk4c%2F054ca519-7705-4438-93e3-0192a89c8ac0%2Fresources%2FUpdateDeployment?Timestamp=2017-03-17T22%3A34%3A12Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=b5f66232e0014045ab2a0b46307532af&SignatureVersion=2&Signature=DqmtRbiVFz9vg2gnvsGiYkFmaPUFEDGfJQnHLEoX4w4%3D via POST
Mar 20 17:45:29 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:29,521] (heat-config-notify) [DEBUG] Response <Response [403]>

Note that the fact that yum_update.sh gave an error or not is not relevant here. The problem is that heat on the undercloud never gets the completion status.

On the undercloud we see that heat-api-cfn is throwing a 403:
2017-03-20 17:45:35.500 1294 INFO eventlet.wsgi.server [-] 192.168.24.12 - - [20/Mar/2017 17:45:35] "POST /v1/signal/arn%3Aopenstack%3Aheat%3A%3Ac52ae115ab604e99af5199def0a01b46%3Astacks%2Fovercloud-Compute-73ctxwtus77f-0-3hpvac54kk4c%2F054ca519-7705-4438-93e3-0192a89c8ac0%2Fresources%2FUpdateDeployment?Timestamp=2017-03-17T22%3A34%3A12Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=b5f66232e0014045ab2a0b46307532af&SignatureVersion=2&Signature=DqmtRbiVFz9vg2gnvsGiYkFmaPUFEDGfJQnHLEoX4w4%3D HTTP/1.1" 403 306 0.021620

(Note the clock between undercloud and compute-0 is skewed by 5 seconds, hence the slight difference in time)

So the theory here is that heat-api-cfn gives a 403 to the completion signaling from the compute-0 node and heat-engine is hence never made aware that it actually completed and so we are stuck until timeout.

Comment 8 Zane Bitter 2017-03-23 19:19:23 UTC

EC2 signatures should not expire, so I'd have to assume that it's due to the user associated with the deployment being deleted. That could happen when the resource is being replaced, but in that case we shouldn't be continuing to use the old user's credentials.

Comment 12 Thomas Hervé 2017-03-28 15:13:49 UTC

We haven't found any indication that the 403 was an issue. We found some issue with hooks though.

I opened https://bugzilla.redhat.com/show_bug.cgi?id=1436712 for making heat hook-poll work again. We need another one for tripleoclient/tripleo-common.

Comment 14 Martin Schuppert 2017-03-29 12:41:13 UTC

I opened https://bugzilla.redhat.com/show_bug.cgi?id=1437016 for tripleoclient

Comment 15 Steve Baker 2017-04-10 22:02:35 UTC

Here is the situation as far as I understand it wrt 403s on the os-collect-config side

403s are expected when 55-heat-config or 99-refresh-completed attempt to signal a resource which no longer exists (which could happen for a few reasons, database restore, multiple stack updates spanning a network outage).

For deployment resources which call hooks (group: script|puppet), signalling happens via 55-heat-config -> heat-config-notify. If the response is a 403, 55-heat-config will continue to process other deployments.

For os-apply-config resources (group: os-apply-config) signalling happens via 99-refresh-completed and due to bug 1285495 any 403 will result in later os-apply-config resources not being signalled. This is only an issue in OSP-10 since in OSP-11 onwards os-apply-config is handled by a hook, so signalling happens via 55-heat-config.

bug 1285495 should still be fixed for OSP-10 if 403s are causing *other* os-apply-config resources to never be signalled - it is not clear whether that is the case here. Unless there is another bug in heat when users are being deleted when they shouldn't be, fixing bug 1285495 in OSP-10 will make all 403s harmless.

If it is common enough that a stuck deployment resource prevents minor updates like this, it might be useful to build a client tool which interactively shows the user what deployment is still waiting for a signal and gives them the option of sending a fake signal which will indicate a COMPLETE or FAILED for that resource.

Comment 17 Zane Bitter 2017-06-29 14:34:16 UTC

Resolved by the fix for bug 1436712.

Note You need to log in before you can comment on or make changes to this bug.