Description of problem: So this BZ is a spinoff from https://bugzilla.redhat.com/show_bug.cgi?id=1432571 in order to precisely track an issue that we were able to reproduce internally as well. Namely, when running: openstack overcloud update stack -i overcloud And giving compute-0 as a first node, when asked, we observe the following: 1. The yum_update.sh script runs to completion on compute-0 2. It tries to signal the completion back to heat on the undercloud 3. The heat-api-cfn service seems to return a 403 4. So heat-engine never gets aware of the completion and the command is stuck IN_PROGRESS until the general timeout kicks in Version-Release number of selected component (if applicable): openstack-heat-api-cfn-7.0.2-4.el7ost.noarch openstack-heat-engine-7.0.2-4.el7ost.noarch openstack-heat-common-7.0.2-4.el7ost.noarch openstack-heat-api-7.0.2-4.el7ost.noarch Note: that we did apply https://review.openstack.org/#/c/441972/ by hand and restarted heat-* services. How reproducible: Not always unfortunately. Actual results: client stuck in IN_PROGRESS Expected results: Move to the next node for the update Additional info: From the sosreport on the compute node we see this (sos_commands/logs/journalctl_--no-pager_--boot): Mar 20 17:45:28 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:27,997] (heat-config) [INFO] deploy_signal_verb=POST Mar 20 17:45:28 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:27,997] (heat-config) [DEBUG] Running /var/lib/heat-config/heat-config-script/df00f14a-bd21-49bc-9c2f-6c109381808c Mar 20 17:45:28 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:28,907] (heat-config) [INFO] Started yum_update.sh on server 6ce9048b-721a-46bb-9ebb-6a4006e039e2 at Mon Mar 20 17:45:28 UTC 2017 Mar 20 17:45:28 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:28,907] (heat-config) [DEBUG] Mar 20 17:45:28 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:28,907] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/df00f14a-bd21-49bc-9c2f-6c109381808c. [3] Mar 20 17:45:28 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:28,912] (heat-config) [INFO] Completed /usr/libexec/heat-config/hooks/script Mar 20 17:45:28 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:28,912] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/df00f14a-bd21-49bc-9c2f-6c109381808c.json < /var/lib/heat-config/deployed/df00f14a-bd21-49bc-9c2f-6c109381808c.notify.json Mar 20 17:45:29 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:29,548] (heat-config) [INFO] Mar 20 17:45:29 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:29,549] (heat-config) [DEBUG] [2017-03-20 17:45:29,485] (heat-config-notify) [DEBUG] Signaling to http://192.168.24.1:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3Ac52ae115ab604e99af5199def0a01b46%3Astacks%2Fovercloud-Compute-73ctxwtus77f-0-3hpvac54kk4c%2F054ca519-7705-4438-93e3-0192a89c8ac0%2Fresources%2FUpdateDeployment?Timestamp=2017-03-17T22%3A34%3A12Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=b5f66232e0014045ab2a0b46307532af&SignatureVersion=2&Signature=DqmtRbiVFz9vg2gnvsGiYkFmaPUFEDGfJQnHLEoX4w4%3D via POST Mar 20 17:45:29 compute-0.localdomain os-collect-config[247960]: [2017-03-20 17:45:29,521] (heat-config-notify) [DEBUG] Response <Response [403]> Note that the fact that yum_update.sh gave an error or not is not relevant here. The problem is that heat on the undercloud never gets the completion status. On the undercloud we see that heat-api-cfn is throwing a 403: 2017-03-20 17:45:35.500 1294 INFO eventlet.wsgi.server [-] 192.168.24.12 - - [20/Mar/2017 17:45:35] "POST /v1/signal/arn%3Aopenstack%3Aheat%3A%3Ac52ae115ab604e99af5199def0a01b46%3Astacks%2Fovercloud-Compute-73ctxwtus77f-0-3hpvac54kk4c%2F054ca519-7705-4438-93e3-0192a89c8ac0%2Fresources%2FUpdateDeployment?Timestamp=2017-03-17T22%3A34%3A12Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=b5f66232e0014045ab2a0b46307532af&SignatureVersion=2&Signature=DqmtRbiVFz9vg2gnvsGiYkFmaPUFEDGfJQnHLEoX4w4%3D HTTP/1.1" 403 306 0.021620 (Note the clock between undercloud and compute-0 is skewed by 5 seconds, hence the slight difference in time) So the theory here is that heat-api-cfn gives a 403 to the completion signaling from the compute-0 node and heat-engine is hence never made aware that it actually completed and so we are stuck until timeout.
EC2 signatures should not expire, so I'd have to assume that it's due to the user associated with the deployment being deleted. That could happen when the resource is being replaced, but in that case we shouldn't be continuing to use the old user's credentials.
We haven't found any indication that the 403 was an issue. We found some issue with hooks though. I opened https://bugzilla.redhat.com/show_bug.cgi?id=1436712 for making heat hook-poll work again. We need another one for tripleoclient/tripleo-common.
I opened https://bugzilla.redhat.com/show_bug.cgi?id=1437016 for tripleoclient
Here is the situation as far as I understand it wrt 403s on the os-collect-config side 403s are expected when 55-heat-config or 99-refresh-completed attempt to signal a resource which no longer exists (which could happen for a few reasons, database restore, multiple stack updates spanning a network outage). For deployment resources which call hooks (group: script|puppet), signalling happens via 55-heat-config -> heat-config-notify. If the response is a 403, 55-heat-config will continue to process other deployments. For os-apply-config resources (group: os-apply-config) signalling happens via 99-refresh-completed and due to bug 1285495 any 403 will result in later os-apply-config resources not being signalled. This is only an issue in OSP-10 since in OSP-11 onwards os-apply-config is handled by a hook, so signalling happens via 55-heat-config. bug 1285495 should still be fixed for OSP-10 if 403s are causing *other* os-apply-config resources to never be signalled - it is not clear whether that is the case here. Unless there is another bug in heat when users are being deleted when they shouldn't be, fixing bug 1285495 in OSP-10 will make all 403s harmless. If it is common enough that a stuck deployment resource prevents minor updates like this, it might be useful to build a client tool which interactively shows the user what deployment is still waiting for a signal and gives them the option of sending a fake signal which will indicate a COMPLETE or FAILED for that resource.
Resolved by the fix for bug 1436712.