Created attachment 1098934 [details] capped undercloud heat and keystone logs Description of problem: I see an issue in heat-config during an update, specifically seems like heat-config-notify gets a 403 response (forbidden?) - could this be a credentials issue somehow. Log looks like: (more attached - the sosreport from controller0 has the full log of the following) Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,353] (heat-config) [DEBUG] Error: unable to push cib Nov 25 10:15:30 overcloud-controller-0 os-collect-config: Call cib_replace failed (-205): Update was older than existing configuration Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,353] (heat-config) [INFO] Completed /var/lib/heat-config/heat-config-script/49906ac4-5f22-427d-9ba3-3c0e958718e2 Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,356] (heat-config) [INFO] Completed /var/lib/heat-config/hooks/script Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,357] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/49906ac4-5f22-427d-9ba3-3c0e958718e2.json < /var/lib/heat-config/deployed/49906ac4-5f22-427d-9ba3-3c0e958718e2.notify.json Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,936] (heat-config) [INFO] Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,936] (heat-config) [DEBUG] [2015-11-25 10:15:30,887] (heat-config-notify) [DEBUG] Signaling to http://192.0.2.1:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3A4102f61fe21344e99daa97add9a368e4%3Astacks%2Fovercloud-Controller-iko5lkuvmmnc-0-halix4udz2xo%2F30065722-128e-49cb-9b68-2e2de7a98f58%2Fresources%2FUpdateDeployment?Timestamp=2015-11-18T07%3A13%3A55Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=d88efa476783480895e59014f475b2d8&SignatureVersion=2&Signature=Z6OEDtL9tfT155TnIIelHuRukE67OJ7rFHJmrWf%2B9uo%3D via POST Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,915] (heat-config-notify) [DEBUG] Response <Response [403]> Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,936] (heat-config) [WARNING] Skipping group os-apply-config with no hook script /var/lib/heat-config/hooks/os-apply-config Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,936] (heat-config) [WARNING] Skipping group Heat::Ungrouped with no hook script /var/lib/heat-config/hooks/HeatUngrouped Nov 25 10:15:30 overcloud-controller-0 os-collect-config: [2015-11-25 10:15:30,937] (heat-config) [DEBUG] Running /var/lib/heat-config/hooks/puppet < /var/lib/heat-config/deployed/dabdc653-886f-427d-b21e-f2d915065b8b.json Nov 25 10:15:31 overcloud-controller-0 crmd[2354]: notice: Operation mongod_start_0: ok (node=overcloud-controller-0, call=190, rc=0, cib-update=57, confirmed=true) Nov 25 10:15:31 overcloud-controller-0 redis(redis)[4584]: INFO: start: /usr/bin/redis-server --daemonize yes --unixsocket '/var/run/redis/redis.sock' --pidfile '/var/run/redis/redis-server.pid' The context here is that this happens during an update. Starting with a running 7.1 overcloud, deployed like openstack overcloud deploy --templates --control-scale 3 --compute-scale 1 --libvirt-type qemu --ntp-server "0.fedora.pool.ntp.org" -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml yum update the undercloud then start the overcloud update like: openstack overcloud update stack overcloud -i --templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e update.yaml Control1 is updated successfully, then compute0 is also updated. Control0 starts to get updated... but then control0 seems to hang - Bug 1285485 may be the root cause of that but I don't think that 403 should be there either? More context about the exact setup here is in that other bug if interested. Here I attach logs for undercloud keystone and heat. I had to cap the length of the undercloud log files as they were massive.. i have more full/older logs if these are required thanks, marios Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1098935 [details] control0 sosreport
i wonder if maybe this could also be related to bug 1278181. since /var/run/heat-config was gone and we're rerunning all the already applied deployments, perhaps heat-config-notify is trying to signal to a url that is no longer valid
steve, could you have a look at this one? i think marios has attached all the relevant logs.
It has been reported by others that they updated from 7.1-->7.2 without issue, FYI... so this may be an environment issue...
I don't think this is related to bug 1278181, or to rebooting the node. This is likely an environment issue. I found where in heat-api-cfn.log. The tl;dr is that keystone returned the 403. I'll keep looking at the log to see if that was a transient error. 2015-11-25 10:15:22.096 29174 INFO eventlet.wsgi.server [req-f482fe78-2a90-4209-b870- 1f5d1ced704f 8bab563883474bdbbe269624dd0ab538 a27fb86e7cae453aa8e18e8d1fd40ccc] 192.0 .2.15 - - [25/Nov/2015 10:15:22] "GET /v1/?SignatureVersion=2&AWSAccessKeyId=926d1433 b89443b18d3f9b183f069ffc&StackName=overcloud-Controller-iko5lkuvmmnc-0-halix4udz2xo&S ignatureMethod=HmacSHA256&Signature=BuyUHmNBAULUPzeRhcSdxwC3U69ESfznaUpqiR2hOTM%3D&Ac tion=DescribeStackResource&LogicalResourceId=Controller HTTP/1.1" 200 522800 1.577003 2015-11-25 10:15:30.916 29174 INFO eventlet.wsgi.server [-] (29174) accepted ('192.0.2.15', 49828) 2015-11-25 10:15:30.917 29174 DEBUG heat.api.middleware.version_negotiation [-] Processing request: POST /v1/signal/arn:openstack:heat::4102f61fe21344e99daa97add9a368e4:stacks/overcloud-Controller-iko5lkuvmmnc-0-halix4udz2xo/30065722-128e-49cb-9b68-2e2de7a98f58/resources/UpdateDeployment Accept: */* process_request /usr/lib/python2.7/site-packages/heat/api/middleware/version_negotiation.py:50 2015-11-25 10:15:30.917 29174 DEBUG heat.api.middleware.version_negotiation [-] Matched versioned URI. Version: 1.0 process_request /usr/lib/python2.7/site-packages/heat/api/middleware/version_negotiation.py:65 2015-11-25 10:15:30.917 29174 INFO heat.api.aws.ec2token [-] Checking AWS credentials.. 2015-11-25 10:15:30.918 29174 INFO heat.api.aws.ec2token [-] AWS credentials found, checking against keystone. 2015-11-25 10:15:30.932 29174 INFO heat.api.aws.ec2token [-] Authenticating with http://192.0.2.1:5000/v3/ec2tokens 2015-11-25 10:15:30.937 29174 INFO heat.api.aws.ec2token [-] AWS authentication failure. 2015-11-25 10:15:30.938 29174 INFO eventlet.wsgi.server [-] 192.0.2.15 - - [25/Nov/2015 10:15:30] "POST /v1/signal/arn%3Aopenstack%3Aheat%3A%3A4102f61fe21344e99daa97add9a368e4%3Astacks%2Fovercloud-Controller-iko5lkuvmmnc-0-halix4udz2xo%2F30065722-128e-49cb-9b68-2e2de7a98f58%2Fresources%2FUpdateDeployment?Timestamp=2015-11-18T07%3A13%3A55Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=d88efa476783480895e59014f475b2d8&SignatureVersion=2&Signature=Z6OEDtL9tfT155TnIIelHuRukE67OJ7rFHJmrWf%2B9uo%3D HTTP/1.1" 403 306 0.020683
And the keystone log 2015-11-25 10:15:30.934 8740 DEBUG keystone.middleware.core [-] Auth token not in the request header. Will not build auth context. process_request /usr/lib/python2.7/site-packages/keystone/middleware/core.py:223 2015-11-25 10:15:30.934 8740 INFO keystone.common.wsgi [-] POST http://192.0.2.1:5000/v3/ec2tokens 2015-11-25 10:15:30.936 8740 WARNING keystone.common.wsgi [-] Could not find credential: 42d281acfab9f32048bfb83281de2f79e8b1709b1c73376cdb6f2b7f403d91bd 2015-11-25 10:15:30.937 8740 INFO eventlet.wsgi.server [-] 192.0.2.1 - - [25/Nov/2015 10:15:30] "POST /v3/ec2tokens HTTP/1.1" 404 378 0.002943 Could not find credential: 42d281acfab9f32048bfb83281de2f79e8b1709b1c73376cdb6f2b7f403d91bd I've not been able to reproduce this locally yet. Lets leave it open and see if it happens to anyone else.
I wouldn't be suprised if this turned out to also be the mystery trigger for bug 1286118 (in which the user gets removed from Keystone for unknown reasons, and subsequently Heat was refusing to delete the resource because of the Not Found error).
I can trigger this repeatedly locally now by forcing all deployment and deployment group resources to replace on update, so I'll be diagnosing the root cause next week. Any stack stuck in this state can now be recovered by doing the following. - Upgrade the undercloud heat packages to 2015.1.2-4 or later - Restart openstack-heat-engine and wait for the overcloud stack to be put into an UPDATE_FAILED state (can someone please confirm whether the package upgrade triggers the restart and puts the overcloud into UPDATE_FAILED) - Repeat the original openstack overcloud command which was stalled by the 403 issue.
My 403s were caused by stale data in /var/lib/os-collect-config as I was switching from one software_config_transport to another. I'll be fixing that upstream with https://bugs.launchpad.net/heat/+bug/1523676 I've come to the conclusion that the 403s themselves are harmless, however 99-refresh-config exits as soon as a signal attempt fails[1] which can leave any remaining signals to heat unsent. This would explain some deployment resources remaining IN_PROGRESS when 403s are seen in the os-collect-config log. Complicating the fix is that 99-refresh-completed gets to /usr/libexec/os-apply-config during image building, so no package update will propagate the fix. [1] http://git.openstack.org/cgit/openstack/tripleo-image-elements/tree/elements/os-refresh-config/os-refresh-config/post-configure.d/99-refresh-completed#n36
(In reply to Steve Baker from comment #11) > My 403s were caused by stale data in /var/lib/os-collect-config as I was > switching from one software_config_transport to another. I'll be fixing that > upstream with https://bugs.launchpad.net/heat/+bug/1523676 > > I've come to the conclusion that the 403s themselves are harmless, however > 99-refresh-config exits as soon as a signal attempt fails[1] which can leave > any remaining signals to heat unsent. This would explain some deployment > resources remaining IN_PROGRESS when 403s are seen in the os-collect-config > log. > > Complicating the fix is that 99-refresh-completed gets to > /usr/libexec/os-apply-config during image building, so no package update > will propagate the fix. > > [1] > http://git.openstack.org/cgit/openstack/tripleo-image-elements/tree/elements/ > os-refresh-config/os-refresh-config/post-configure.d/99-refresh-completed#n36 Is 403 an acceptable response to consider as Ok in all circumstances? E.g., can we just modify line 35 in 99-refresh-config to accept 200 and 403? I'm perplexed why such a solution is acceptable. If it doesn't matter that these signals are accepted by Heat, why do we need to send them at all? Or is that Heat registers the signal, does whatever it needs to do, then hits the 403 after that and returns that to the caller?
i'm also not sure what you're saying the full fix is here. Is it to just modify 99-refresh-completed? Or should comment 10 also fix it? I have an environment that is hitting this now, and I have that latest version of Heat, and doing that workaround just causes it to keep getting stuck each time I retry. So I also need the 99-refresh-completed fix?
I think the full fix is for 99-refresh-completed to not fail when it receives a 403, and a workaround is comment 10. The 403s happen because the deployment resource (and its associated user) has been deleted. This will generally be because a heat stack-update resulted in a deployment resource being replaced (so the old deployment signal will 403, but the new one won't). The 403s will be transient, as the node continues to poll for data the old 403-causing deployments will disappear. Once 99-refresh-completed is fixed, the 403s will be mainly an annoyance, but I'll be submitting some fixes to upstream heat to minimise the need for deployment resources to be replaced ever.
This bug did not make the OSP 8.0 release. It is being deferred to OSP 10.
Clean up - this seems obsolete, please, re-open if still valid.
I'm going to reopen this as still potentially affecting OSP-10, but not OSP-11 onwards because os-apply-config is now performed via a hook, so signalling happens via 55-heat-config instead of 99-refresh-completed. Comment #14 is still a valid fix for this issue, 99-refresh-completed should tolerate 200 and 403 when signalling.
Are the heat-config-notify 403 errors still an issue for people ? I suspect that some of the 403 errors are a result of a mismatch between the data that was used to create the EC2 token (heat/engine/resources/signal_responder.py / _get_ec2_signed_url), and the data posted to keystone to verify the EC2 token (heat/api/aws/ec2token.py / _authorize) Limiting the data sent to keystone to just the query string that _get_ec2_signed_url created resolves the 403 errors for me [1] [1] https://bugs.launchpad.net/magnum/+bug/1720816/comments/10