Description of problem: Scaling out an Openstack Infrastructure provider with an additional compute node for an overcloud updated from 7.1 to 7.2 doesn't work. Version-Release number of selected component (if applicable): 5.5.0.13 How reproducible: 100% Steps to Reproduce: 1. Deploy RHEL OSP 7.1 (3 ctrls, 1 compute and 3 ceph nodes in my test) 2. Update both undercloud and overcloud to version 7.2 by following the update procedure 3. Add the undercloud to Cloudforms as Openstack Infra provider 4. Scale out the infra provider with an additional compute node Actual results: The stack update seems to get stuck. Expected results: The stack update finishes. Additional info: From Openstack side: Initial deploy command: openstack overcloud deploy \ --templates ~/templates/my-overcloud \ --control-scale 3 --compute-scale 1 --ceph-storage-scale 3 \ --ntp-server clock.redhat.com \ --libvirt-type qemu \ -e ~/templates/my-overcloud/environments/network-isolation.yaml \ -e ~/templates/network-environment.yaml \ -e ~/templates/firstboot-environment.yaml \ -e ~/templates/ceph.yaml Update command: /usr/bin/yes '' | openstack overcloud update stack overcloud -i \ --templates ~/templates/my-overcloud \ -e ~/templates/my-overcloud/overcloud-resource-registry-puppet.yaml \ -e ~/templates/my-overcloud/environments/network-isolation.yaml \ -e ~/templates/network-environment.yaml \ -e ~/templates/firstboot-environment.yaml \ -e ~/templates/ceph.yaml \ -e ~/templates/my-overcloud/environments/updates/update-from-vip.yaml \ -e ~/templates/ctrlport.yaml
this is the call in ManageIQ https://bugzilla.redhat.com/show_bug.cgi?id=1288220 We do call PATCH with master template, we get from api and only one Count parameter, used for scaling.
Created attachment 1103221 [details] os-collect-config Moving this to the Openstack side. I tested the scale out on a non-update 7.2 deployment and it worked. When I try to scale out on an updated deployment I can see the new compute node provisioned and yum update triggered for it. In the end the overcloud stack update timeouts after 240minutes. Attaching the os-collect-config log from the added compute node.
From what I can tell from the os-collect-config log the last action on the newly added compute was: ++ curl -s -w '%{http_code}' -X POST -H Content-Type: -o /tmp/tmp.UOjYLGlR5o --data-binary '{"deploy_stdout": "os-apply-config deployment 0dcd5cb4-3eb7-41ae-88ac-66decc9bfd75 completed", "deploy_status_code": "0"}' 'http://192.0.2.1:8000/v 1/signal/arn%3Aopenstack%3Aheat%3A%3A9edce9ad17474a2093b89749ea8ae99c%3Astacks%2Fovercloud-Compute-vwqrtha3heky-1-727wpn3ted3l%2F024b0343-e168-4ab8-be40-464a96187655%2Fresources%2FNovaComputeDeployment?Timestamp=2015-12-07T12%3A44%3A15Z&Si gnatureMethod=HmacSHA256&AWSAccessKeyId=5e5bf7331dcc4bfe8a2603ff8b8efb2e&SignatureVersion=2&Signature=FfuLQZUy%2FjUABdLZUQGfEHEOiMJOYcUXj9bH3OJEA4w%3D' + status=200 while the undercloud heat-api.log shows { "code": 404, "error": { "message": "Not found", "traceback": "Traceback (most recent call last):\n\n File \"/usr/lib/python2.7/site-packages/heat/common/context.py\", line 300, in wrapped\n return func(self, ctx, *args, **kwargs)\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/service.py\", line 1504, in show_software_deployment\n cnxt, deployment_id)\n\n File \"/usr/lib/python2.7/site-packages/heat/engine/service_software_config.py\", line 148, in show_software_deployment\n cnxt, deployment_id)\n\n File \"/usr/lib/python2.7/site-packages/heat/objects/software_deployment.py\", line 72, in get_by_id\n db_api.software_deployment_get(context, deployment_id))\n\n File \"/usr/lib/python2.7/site-packages/heat/db/api.py\", line 292, in software_deployment_get\n return IMPL.software_deployment_get(context, deployment_id)\n\n File \"/usr/lib/python2.7/site-packages/heat/db/sqlalchemy/api.py\", line 820, in software_deployment_get\n deployment_id)\n\nNotFound: Deployment with id 0dcd5cb4-3eb7-41ae-88ac-66decc9bfd75 not found\n", "type": "NotFound" }, "explanation": "The resource could not be found.", "title": "Not Found" } 2015-12-07 11:33:14.715 1241 INFO eventlet.wsgi.server [req-e57b9aa6-d5c2-4892-9790-f203e3b5d9f5 fb1641bfac4d45c2a9af85bf39c78cb3 9edce9ad17474a2093b89749ea8ae99c] 192.0.2.1 - - [07/Dec/2015 11:33:14] "GET /v1/9edce9ad17474a2093b89749ea8ae99c/software_deployments/0dcd5cb4-3eb7-41ae-88ac-66decc9bfd75 HTTP/1.1" 404 1382 0.165222
This is believed to be resolved by the fix posted on https://bugzilla.redhat.com/show_bug.cgi?id=1290796 Can it be re-tested, with that fix in place?
I re-tested with the latest build but I am getting the same results. The stack update hasn't failed yet but I am expecting it to time out after 240 minutes. I'm going to leave the environment available for debugging. Thank you.
I am hitting the same issue, scale out is impossible after update from 7.1 to 7.2. There is another bug related to this: https://bugzilla.redhat.com/show_bug.cgi?id=1257950.
Got the same result with the last scale out attempt: | stack_status | UPDATE_FAILED | | stack_status_reason | Timed out
Non-working scale-out is a regression, requesting a blocker flag.
in the os-collect-config log, it's clear the compute node is still running yum update on scale out, which is not intended and very likely the reason for the timeout. this indicates that cloudforms is not using the same fix as illustrated in https://bugzilla.redhat.com/show_bug.cgi?id=1290796 on scale out atttempts.
@James not sure what fix there should be. For scale out, CloudForms is doing Heat PATCH, changing only ComputeCount parameter and sending master template it downloads from the API, nothing else. I think that should always succeed, shouldn't it?
(In reply to Ladislav Smola from comment #13) > @James not sure what fix there should be. > > For scale out, CloudForms is doing Heat PATCH, changing only ComputeCount > parameter and sending master template it downloads from the API, nothing > else. I think that should always succeed, shouldn't it? have a look at the way the client is passing UpdateIdentifier in clear_parameters on stack update to keep nodes from yum updating on scale out
(In reply to James Slagle from comment #14) > (In reply to Ladislav Smola from comment #13) > > @James not sure what fix there should be. > > > > For scale out, CloudForms is doing Heat PATCH, changing only ComputeCount > > parameter and sending master template it downloads from the API, nothing > > else. I think that should always succeed, shouldn't it? > > have a look at the way the client is passing UpdateIdentifier in > clear_parameters on stack update to keep nodes from yum updating on scale out I can't find a reference to UpdateIdentifier upstream or in any of the pending patches; do you mean DeployIdentifier in _update_parameters? It seems to match what you're talking about, but I just want to be sure.
Created attachment 1129928 [details] os-collect-config log
I updated the CloudForms PATCH call to use clear_parameters with DeployIdentifier and UpdateIdentifier. The resulting stack scale-up successfully bypassed the yum update; however, the update seems to hang. I've attached a log of os-collect-config; are there other items that might help us understand what's going on?
Created attachment 1130236 [details] heat update failures
Some additional info: it turns out that the scale out works if you run it a *second* time. It looks like this is because there may be Heat hooks leftover after the overcloud update that are not cleared. The second scale out attempt works because it happens after the user deletes the Heat resource after the timeout, clearing the hooks. If the hooks are leftover from after the update, then the cleanest solution would be to make sure that the update cleans up the hooks.
This bug did not make the OSP 8.0 release. It is being deferred to OSP 10.