Bug 1285495 - during update heat-config-notify gets a 403 response
during update heat-config-notify gets a 403 response
Status: NEW
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-image-elements (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
low Severity medium
: ---
: 10.0 (Newton)
Assigned To: Steve Baker
Shai Revivo
: Reopened, Triaged, ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-25 13:22 EST by marios
Modified: 2017-09-21 10:47 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-10-10 00:10:31 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
capped undercloud heat and keystone logs (2.34 MB, application/x-xz)
2015-11-25 13:22 EST, marios
no flags Details
control0 sosreport (11.87 MB, application/x-xz)
2015-11-25 13:25 EST, marios
no flags Details

  None (edit)
Description marios 2015-11-25 13:22:14 EST
Created attachment 1098934 [details]
capped undercloud heat and keystone logs

Description of problem:

I see an issue in heat-config during an update, specifically seems like 
heat-config-notify gets a 403 response (forbidden?) - could this be a 
credentials issue somehow. Log looks like: (more attached - the sosreport from 
controller0 has the full log of the following)

Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,353] (heat-config) [DEBUG] Error: unable to push cib
Nov 25 10:15:30 overcloud-controller-0 os-collect-config: Call cib_replace failed (-205): Update was older than existing configuration
Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,353] (heat-config) [INFO] Completed /var/lib/heat-config/heat-config-script/49906ac4-5f22-427d-9ba3-3c0e958718e2
Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,356] (heat-config) [INFO] Completed /var/lib/heat-config/hooks/script
Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,357] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/49906ac4-5f22-427d-9ba3-3c0e958718e2.json < /var/lib/heat-config/deployed/49906ac4-5f22-427d-9ba3-3c0e958718e2.notify.json
Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,936] (heat-config) [INFO]
Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,936] (heat-config) [DEBUG] [2015-11-25 10:15:30,887] (heat-config-notify) [DEBUG] Signaling to http://192.0.2.1:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3A4102f61fe21344e99daa97add9a368e4%3Astacks%2Fovercloud-Controller-iko5lkuvmmnc-0-halix4udz2xo%2F30065722-128e-49cb-9b68-2e2de7a98f58%2Fresources%2FUpdateDeployment?Timestamp=2015-11-18T07%3A13%3A55Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=d88efa476783480895e59014f475b2d8&SignatureVersion=2&Signature=Z6OEDtL9tfT155TnIIelHuRukE67OJ7rFHJmrWf%2B9uo%3D via POST
Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,915] (heat-config-notify) [DEBUG] Response <Response [403]>
Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,936] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config
Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,936] (heat-config) [WARNING] Skipping group Heat::Ungrouped with no hook script /var/lib/heat-config/hooks/HeatUngrouped
Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,937] (heat-config) [DEBUG] Running /var/lib/heat-config/hooks/puppet < /var/lib/heat-config/deployed/dabdc653-886f-427d-b21e-f2d915065b8b.json
Nov 25 10:15:31 overcloud-controller-0 crmd[2354]: notice: Operation mongod_start_0: ok (node=overcloud-controller-0, call=190, rc=0, cib-update=57, confirmed=true)
Nov 25 10:15:31 overcloud-controller-0 redis(redis)[4584]: INFO: start: /usr/bin/redis-server --daemonize yes --unixsocket '/var/run/redis/redis.sock' --pidfile '/var/run/redis/redis-server.pid'


The context here is that this happens during an update. Starting with a running
 7.1 overcloud, deployed like

openstack overcloud deploy --templates --control-scale 3 --compute-scale 1  --libvirt-type qemu --ntp-server "0.fedora.pool.ntp.org" -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml

yum update the undercloud then start the overcloud update like:

openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e update.yaml

Control1 is updated successfully, then compute0 is also updated. Control0 starts
 to get updated... but then control0 seems to hang - Bug 1285485 may be the root cause of that but I don't think that 403 should be there either? More context about the exact setup here is in that other bug if interested. Here I attach logs for undercloud keystone and heat.
 
I had to cap the length of the undercloud log files as they were massive.. i have more full/older logs if these are required
 
thanks, marios
 







Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Comment 1 marios 2015-11-25 13:25 EST
Created attachment 1098935 [details]
control0 sosreport
Comment 2 James Slagle 2015-11-25 13:42:59 EST
i wonder if maybe this could also be related to bug 1278181.

since /var/run/heat-config was gone and we're rerunning all the already applied deployments, perhaps heat-config-notify is trying to signal to a url that is no longer valid
Comment 3 James Slagle 2015-11-25 13:45:40 EST
steve, could you have a look at this one? i think marios has attached all the relevant logs.
Comment 4 marios 2015-11-27 01:43:37 EST
It has been reported by others that they updated from 7.1-->7.2 without issue, FYI... so this may be an environment issue...
Comment 6 Steve Baker 2015-11-30 22:08:37 EST
I don't think this is related to bug 1278181, or to rebooting the node. This is likely an environment issue.

I found where in heat-api-cfn.log. The tl;dr is that keystone returned the 403. I'll keep looking at the log to see if that was a transient error.

2015-11-25 10:15:22.096 29174 INFO eventlet.wsgi.server [req-f482fe78-2a90-4209-b870-
1f5d1ced704f 8bab563883474bdbbe269624dd0ab538 a27fb86e7cae453aa8e18e8d1fd40ccc] 192.0
.2.15 - - [25/Nov/2015 10:15:22] "GET /v1/?SignatureVersion=2&AWSAccessKeyId=926d1433
b89443b18d3f9b183f069ffc&StackName=overcloud-Controller-iko5lkuvmmnc-0-halix4udz2xo&S
ignatureMethod=HmacSHA256&Signature=BuyUHmNBAULUPzeRhcSdxwC3U69ESfznaUpqiR2hOTM%3D&Ac
tion=DescribeStackResource&LogicalResourceId=Controller HTTP/1.1" 200 522800 1.577003
2015-11-25 10:15:30.916 29174 INFO eventlet.wsgi.server [-] (29174) accepted ('192.0.2.15', 49828)
2015-11-25 10:15:30.917 29174 DEBUG heat.api.middleware.version_negotiation [-] Processing request: POST /v1/signal/arn:openstack:heat::4102f61fe21344e99daa97add9a368e4:stacks/overcloud-Controller-iko5lkuvmmnc-0-halix4udz2xo/30065722-128e-49cb-9b68-2e2de7a98f58/resources/UpdateDeployment Accept: */* process_request /usr/lib/python2.7/site-packages/heat/api/middleware/version_negotiation.py:50
2015-11-25 10:15:30.917 29174 DEBUG heat.api.middleware.version_negotiation [-] Matched versioned URI. Version: 1.0 process_request /usr/lib/python2.7/site-packages/heat/api/middleware/version_negotiation.py:65
2015-11-25 10:15:30.917 29174 INFO heat.api.aws.ec2token [-] Checking AWS credentials..
2015-11-25 10:15:30.918 29174 INFO heat.api.aws.ec2token [-] AWS credentials found, checking against keystone.
2015-11-25 10:15:30.932 29174 INFO heat.api.aws.ec2token [-] Authenticating with http://192.0.2.1:5000/v3/ec2tokens
2015-11-25 10:15:30.937 29174 INFO heat.api.aws.ec2token [-] AWS authentication failure.
2015-11-25 10:15:30.938 29174 INFO eventlet.wsgi.server [-] 192.0.2.15 - - [25/Nov/2015 10:15:30] "POST /v1/signal/arn%3Aopenstack%3Aheat%3A%3A4102f61fe21344e99daa97add9a368e4%3Astacks%2Fovercloud-Controller-iko5lkuvmmnc-0-halix4udz2xo%2F30065722-128e-49cb-9b68-2e2de7a98f58%2Fresources%2FUpdateDeployment?Timestamp=2015-11-18T07%3A13%3A55Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=d88efa476783480895e59014f475b2d8&SignatureVersion=2&Signature=Z6OEDtL9tfT155TnIIelHuRukE67OJ7rFHJmrWf%2B9uo%3D HTTP/1.1" 403 306 0.020683
Comment 7 Steve Baker 2015-11-30 22:17:44 EST
And the keystone log

2015-11-25 10:15:30.934 8740 DEBUG keystone.middleware.core [-] Auth token not in the request header. Will not build auth context. process_request /usr/lib/python2.7/site-packages/keystone/middleware/core.py:223
2015-11-25 10:15:30.934 8740 INFO keystone.common.wsgi [-] POST http://192.0.2.1:5000/v3/ec2tokens
2015-11-25 10:15:30.936 8740 WARNING keystone.common.wsgi [-] Could not find credential: 42d281acfab9f32048bfb83281de2f79e8b1709b1c73376cdb6f2b7f403d91bd
2015-11-25 10:15:30.937 8740 INFO eventlet.wsgi.server [-] 192.0.2.1 - - [25/Nov/2015 10:15:30] "POST /v3/ec2tokens HTTP/1.1" 404 378 0.002943


Could not find credential: 42d281acfab9f32048bfb83281de2f79e8b1709b1c73376cdb6f2b7f403d91bd

I've not been able to reproduce this locally yet. Lets leave it open and see if it happens to anyone else.
Comment 9 Zane Bitter 2015-12-03 16:32:18 EST
I wouldn't be suprised if this turned out to also be the mystery trigger for bug 1286118 (in which the user gets removed from Keystone for unknown reasons, and subsequently Heat was refusing to delete the resource because of the Not Found error).
Comment 10 Steve Baker 2015-12-04 16:01:29 EST
I can trigger this repeatedly locally now by forcing all deployment and deployment group resources to replace on update, so I'll be diagnosing the root cause next week.

Any stack stuck in this state can now be recovered by doing the following.

- Upgrade the undercloud heat packages to 2015.1.2-4 or later
- Restart openstack-heat-engine and wait for the overcloud stack to be put into an UPDATE_FAILED state (can someone please confirm whether the package upgrade triggers the restart and puts the overcloud into UPDATE_FAILED)
- Repeat the original openstack overcloud command which was stalled by the 403 issue.
Comment 11 Steve Baker 2015-12-07 16:20:19 EST
My 403s were caused by stale data in /var/lib/os-collect-config as I was switching from one software_config_transport to another. I'll be fixing that upstream with https://bugs.launchpad.net/heat/+bug/1523676

I've come to the conclusion that the 403s themselves are harmless, however 99-refresh-config exits as soon as a signal attempt fails[1] which can leave any remaining signals to heat unsent. This would explain some deployment resources remaining IN_PROGRESS when 403s are seen in the os-collect-config log.

Complicating the fix is that 99-refresh-completed gets to /usr/libexec/os-apply-config during image building, so no package update will propagate the fix.

[1] http://git.openstack.org/cgit/openstack/tripleo-image-elements/tree/elements/os-refresh-config/os-refresh-config/post-configure.d/99-refresh-completed#n36
Comment 12 James Slagle 2015-12-11 15:30:46 EST
(In reply to Steve Baker from comment #11)
> My 403s were caused by stale data in /var/lib/os-collect-config as I was
> switching from one software_config_transport to another. I'll be fixing that
> upstream with https://bugs.launchpad.net/heat/+bug/1523676
> 
> I've come to the conclusion that the 403s themselves are harmless, however
> 99-refresh-config exits as soon as a signal attempt fails[1] which can leave
> any remaining signals to heat unsent. This would explain some deployment
> resources remaining IN_PROGRESS when 403s are seen in the os-collect-config
> log.
> 
> Complicating the fix is that 99-refresh-completed gets to
> /usr/libexec/os-apply-config during image building, so no package update
> will propagate the fix.
> 
> [1]
> http://git.openstack.org/cgit/openstack/tripleo-image-elements/tree/elements/
> os-refresh-config/os-refresh-config/post-configure.d/99-refresh-completed#n36

Is 403 an acceptable response to consider as Ok in all circumstances? E.g., can we just modify line 35 in 99-refresh-config to accept 200 and 403?

I'm perplexed why such a solution is acceptable. If it doesn't matter that these signals are accepted by Heat, why do we need to send them at all? Or is that Heat registers the signal, does whatever it needs to do, then hits the 403 after that and returns that to the caller?
Comment 13 James Slagle 2015-12-11 17:17:35 EST
i'm also not sure what you're saying the full fix is here. Is it to just modify 99-refresh-completed?

Or should comment 10 also fix it? I have an environment that is hitting this now, and I have that latest version of Heat, and doing that workaround just causes it to keep getting stuck each time I retry. So I also need the 99-refresh-completed fix?
Comment 14 Steve Baker 2015-12-11 19:30:27 EST
I think the full fix is for 99-refresh-completed to not fail when it receives a 403, and a workaround is comment 10.

The 403s happen because the deployment resource (and its associated user) has been deleted. This will generally be because a heat stack-update resulted in a deployment resource being replaced (so the old deployment signal will 403, but the new one won't).

The 403s will be transient, as the node continues to poll for data the old 403-causing deployments will disappear.

Once 99-refresh-completed is fixed, the 403s will be mainly an annoyance, but I'll be submitting some fixes to upstream heat to minimise the need for deployment resources to be replaced ever.
Comment 17 Mike Burns 2016-04-07 16:57:01 EDT
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.
Comment 19 Jaromir Coufal 2016-10-10 00:10:31 EDT
Clean up - this seems obsolete, please, re-open if still valid.
Comment 20 Steve Baker 2017-04-10 18:02:05 EDT
I'm going to reopen this as still potentially affecting OSP-10, but not OSP-11 onwards because os-apply-config is now performed via a hook, so signalling happens via 55-heat-config instead of 99-refresh-completed.

Comment #14 is still a valid fix for this issue, 99-refresh-completed should tolerate 200 and 403 when signalling.

Note You need to log in before you can comment on or make changes to this bug.