Bug 1264203
Summary: | rhel-osp-director: "openstack overcloud update stack --templates -e <yaml> -i overcloud" failed. | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> |
Component: | openstack-puppet-modules | Assignee: | Emilien Macchi <emacchi> |
Status: | CLOSED ERRATA | QA Contact: | Alexander Chuzhoy <sasha> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | unspecified | CC: | dprince, emacchi, ichavero, jprovazn, jslagle, kbasil, mburns, mcornea, rhel-osp-director-maint, sasha, sbaker, yeylon, zbitter |
Target Milestone: | z2 | Keywords: | Triaged, ZStream |
Target Release: | 7.0 (Kilo) | Flags: | ichavero:
needinfo-
|
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | openstack-puppet-modules-2015.1.8-20.el7ost | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-10-08 12:24:50 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1261921, 1266327 | ||
Bug Blocks: | 1252509 |
Description
Alexander Chuzhoy
2015-09-17 21:01:18 UTC
The heat-engine.log is too big to attach. Hopefully the below snap from the log helps: 2015-09-17 16:50:50.367 11768 ERROR heat.engine.resources.openstack.heat.software_deployment [-] Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1 2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource Traceback (most recent call last): 2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 508, in _action_recorder 2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource yield 2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 578, in _do_action 2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource yield self.action_handler_task(action, args=handler_args) 2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 313, in wrapper 2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource step = next(subtask) 2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 552, in action_handler_task 2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource while not check(handler_data): 2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/software_deployment.py", line 471, in check_create_complete 2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource return self._check_complete() 2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/software_deployment.py", line 278, in _check_complete 2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource raise exc 2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource Error: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1 2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource sasha, i dont think that part of the heat engine log is going to help get to the root cause here. can you upload the heat log somewhere, and paste the link here? or make the environment available for debugging? The error is on the server we're deploying to, not in Heat, so the only thing the Heat logs will help with is findout out which deployment actually failed (although that could just as easily/more easily be found using the API). We need to track down why the software deployment script failed and exited with error code 1. i had a look at sasha's failing environment. Here's the failed deployment: [stack@undercloud ~]$ heat deployment-show 19da2391-6326-4776-8551-55f6831ef4d3 { "status": "FAILED", "server_id": "f935c922-8a35-4ef8-98cc-4f56688ad4ab", "config_id": "08c03ab4-4c4f-4c12-a05a-eb8661eab9e1", "output_values": { "deploy_stdout": "", "deploy_stderr": "\u001b[1;31mError: Package upgrades require that enable_install be set to true at /etc/puppet/modules/tripleo/manifests/packages.pp:48 on node overcloud-controller-0.localdomain\u001b[0m\n\u001b[1;31mError: Package upgrades require that enable_install be set to true at /etc/puppet/modules/tripleo/manifests/packages.pp:48 on node overcloud-controller-0.localdomain\u001b[0m\n", "deploy_status_code": 1 }, "creation_time": "2015-09-23T16:56:34Z", "updated_time": "2015-09-23T19:20:40Z", "input_values": {}, "action": "UPDATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", "id": "19da2391-6326-4776-8551-55f6831ef4d3" } steve, does the above look familiar at all? It looks like this is happening because overcloud-resource-registry-puppet.yaml has parameter_defaults: {EnablePackageInstall: false} The tripleo::packages puppet class has logic which requires enable_install (param EnablePackageInstall) to be true if enable_upgrade is set to true. I think the appropriate fix would be for the command "openstack overcloud update stack" to invoke the stack-update operation with an environment that includes parameter_defaults: {EnablePackageInstall: true} Jan pointed out to me that that isn't a great solution, because the environment is now sticky, so it would become true on every subsequent stack update. Our options for preventing that would be: 1) Pass an environment with "parameter_defaults: {EnablePackageInstall: false}" on every command other than "openstack overcloud update" 2) Pass EnablePackageInstall as a parameter instead in "openstack overcloud update". In every other command, pass "--clear-parameter EnablePackageInstall" to go back to the default. The second is slightly better than the first (which I'd say is just about a non-starter), but it's still unmaintainably error-prone. Another potential wrinkle is that I assume this parameter is passed to all of the Puppet SoftwareDeployments, and this constant change to an input could cause Heat to re-run them when it shouldn't. I think we should consider other options: 3) Get rid of the logic in tripleo::packages that is blocking on this. If somebody went to the trouble of setting enable_upgrade why stop them by making them jump through extra hoops? 4) Set enable_install through the same mechanism that we're setting enable_upgrade instead of through the environment. (In reply to Zane Bitter from comment #12) > 3) Get rid of the logic in tripleo::packages that is blocking on this. If > somebody went to the trouble of setting enable_upgrade why stop them by > making them jump through extra hoops? This would be my preferred option. It would mean changing the puppet logic to say enable_update implies enable_install, so: http://git.openstack.org/cgit/openstack/puppet-tripleo/tree/manifests/packages.pp#n35 would become if !$enable_install and !$enable_update { and the following if block would be removed http://git.openstack.org/cgit/openstack/puppet-tripleo/tree/manifests/packages.pp#n47 I'm assigning this to openstack-puppet-modules I concur that Steve Baker's solution is probably the best option we have for this. I've posted a patch to simplify the use of tripleo::packages when enable_upgrade is set here: https://review.openstack.org/228532 Patch sent upstream by dprince: https://review.openstack.org/#/c/228532/ FailedQA Environment: openstack-puppet-modules-2015.1.8-20.el7ost.noarch comment #17 comment #18 comment #19 The timeout script is looking for value of `hostname` in the output of pcs status, but hostname=overcloud-controller-1.localdomain and the pacemaker node name is overcloud-controller-1 This is actually a problem with the fix for bug 1261921, so that one should go to FailedQA, not this one (it didn't actually get far enough to test the fix for this one). Setting back to on_qa. This bug depends on bug #1261921 Verified: Environment: openstack-puppet-modules-2015.1.8-21.el7ost.noarch The update complets successfully now: ... IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS IN_PROGRESS COMPLETE update finished with status COMPLETE Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2015:1872 |