Bug 1264203

Summary:	rhel-osp-director: "openstack overcloud update stack --templates -e <yaml> -i overcloud" failed.
Product:	Red Hat OpenStack	Reporter:	Alexander Chuzhoy <sasha>
Component:	openstack-puppet-modules	Assignee:	Emilien Macchi <emacchi>
Status:	CLOSED ERRATA	QA Contact:	Alexander Chuzhoy <sasha>
Severity:	high	Docs Contact:
Priority:	high
Version:	unspecified	CC:	dprince, emacchi, ichavero, jprovazn, jslagle, kbasil, mburns, mcornea, rhel-osp-director-maint, sasha, sbaker, yeylon, zbitter
Target Milestone:	z2	Keywords:	Triaged, ZStream
Target Release:	7.0 (Kilo)	Flags:	ichavero: needinfo-
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-puppet-modules-2015.1.8-20.el7ost	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-10-08 12:24:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1261921, 1266327
Bug Blocks:	1252509

Description Alexander Chuzhoy 2015-09-17 21:01:18 UTC

rhel-osp-director: "openstack overcloud update stack --templates  -e <yaml> -i overcloud" failed.

Environment:
openstack-heat-engine-2015.1.1-3.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-62.el7ost.noarch
openstack-heat-templates-0-0.6.20150605git.el7ost.noarch
openstack-heat-api-2015.1.1-3.el7ost.noarch
openstack-heat-common-2015.1.1-3.el7ost.noarch
python-heatclient-0.6.0-1.el7ost.noarch
openstack-heat-api-cfn-2015.1.1-3.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.1-3.el7ost.noarch
instack-undercloud-2.1.2-26.el7ost.noarch


Steps to reproduce:
1. Deploy an overcloud.
2. Attempt to update the overcloud with: "openstack overcloud update stack --templates  -e <yaml> -i overcloud

Result:
The update fails.

heat resource-list -n 5 overcloud|grep -v COMPLE
+---------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-------------------------
--------------------+
| resource_name                               | physical_resource_id                          | resource_type                                     | resource_status | updated_time         | parent_resource
                    |
+---------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+-------------------------
--------------------+
| ComputeNodesPostDeployment                  | 259ca4b9-849e-468b-8882-307d7fc15ea2          | OS::TripleO::ComputePostDeployment                | UPDATE_FAILED   | 2015-09-17T20:49:15Z |
                    |
| ComputePuppetDeployment                     | 96dbaeb6-68e5-48a4-92cf-c7df85d2b52e          | OS::Heat::StructuredDeployments                   | UPDATE_FAILED   | 2015-09-17T20:49:21Z | ComputeNodesPostDeployme
nt                  |
| ControllerNodesPostDeployment               | e86fec51-2cb5-45a5-80d5-7eaf9499395e          | OS::TripleO::ControllerPostDeployment             | UPDATE_FAILED   | 2015-09-17T20:49:21Z |                                             |
| 0                                           | 0ef52538-6b4b-4973-af89-7d1eac25b826          | OS::Heat::StructuredDeployment                    | CREATE_FAILED   | 2015-09-17T20:49:25Z | ComputePuppetDeployment                     |
| ControllerRingbuilderDeployment_Step3       | 7496eb27-d393-4887-831e-2a598d23c96c          | OS::Heat::StructuredDeployments                   | UPDATE_FAILED   | 2015-09-17T20:49:53Z | ControllerNodesPostDeployment               |
| 0                                           | 9d3ffb7a-e8b4-4a22-973b-757b5832a8a9          | OS::Heat::StructuredDeployment                    | CREATE_FAILED   | 2015-09-17T20:49:56Z | ControllerRingbuilderDeployment_Step3       |
| 2                                           | 367d4ae2-a73a-4875-925b-7faa3e2af0f0          | OS::Heat::StructuredDeployment                    | CREATE_FAILED   | 2015-09-17T20:49:58Z | ControllerRingbuilderDeployment_Step3       |
| 1                                           | e227424c-3271-410d-8dfc-06623c482cbe          | OS::Heat::StructuredDeployment                    | CREATE_FAILED   | 2015-09-17T20:49:59Z | ControllerRingbuilderDeployment_Step3       |
+---------------------------------------------+-----------------------------------------------+---------------------------------------------------+-----------------+----------------------+---------------------------------------------+



Note:
I actually saw on the nodes that the yum update worked.


Expected result:
The update should complete successfully.

Comment 3 Alexander Chuzhoy 2015-09-17 21:08:40 UTC

The heat-engine.log is too big to attach. Hopefully the below snap from the log helps:

2015-09-17 16:50:50.367 11768 ERROR heat.engine.resources.openstack.heat.software_deployment [-] Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1


2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource Traceback (most recent call last):
2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 508, in _action_recorder
2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource     yield
2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 578, in _do_action
2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource     yield self.action_handler_task(action, args=handler_args)
2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 313, in wrapper
2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource     step = next(subtask)
2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 552, in action_handler_task
2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource     while not check(handler_data):
2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/software_deployment.py", line 471, in check_create_complete
2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource     return self._check_complete()
2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/software_deployment.py", line 278, in _check_complete
2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource     raise exc
2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource Error: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
2015-09-17 16:50:50.368 11768 TRACE heat.engine.resource

Comment 4 James Slagle 2015-09-23 16:23:47 UTC

sasha, i dont think that part of the heat engine log is going to help get to the root cause here.

can you upload the heat log somewhere, and paste the link here?

or make the environment available for debugging?

Comment 5 Zane Bitter 2015-09-23 16:33:00 UTC

The error is on the server we're deploying to, not in Heat, so the only thing the Heat logs will help with is findout out which deployment actually failed (although that could just as easily/more easily be found using the API). We need to track down why the software deployment script failed and exited with error code 1.

Comment 9 James Slagle 2015-09-23 20:35:44 UTC

i had a look at sasha's failing environment. Here's the failed deployment:

[stack@undercloud ~]$ heat deployment-show 19da2391-6326-4776-8551-55f6831ef4d3
{
  "status": "FAILED", 
  "server_id": "f935c922-8a35-4ef8-98cc-4f56688ad4ab", 
  "config_id": "08c03ab4-4c4f-4c12-a05a-eb8661eab9e1", 
  "output_values": {
    "deploy_stdout": "", 
    "deploy_stderr": "\u001b[1;31mError: Package upgrades require that enable_install be set to true at /etc/puppet/modules/tripleo/manifests/packages.pp:48 on node overcloud-controller-0.localdomain\u001b[0m\n\u001b[1;31mError: Package upgrades require that enable_install be set to true at /etc/puppet/modules/tripleo/manifests/packages.pp:48 on node overcloud-controller-0.localdomain\u001b[0m\n", 
    "deploy_status_code": 1
  }, 
  "creation_time": "2015-09-23T16:56:34Z", 
  "updated_time": "2015-09-23T19:20:40Z", 
  "input_values": {}, 
  "action": "UPDATE", 
  "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", 
  "id": "19da2391-6326-4776-8551-55f6831ef4d3"
}

Comment 10 James Slagle 2015-09-23 20:36:14 UTC

steve, does the above look familiar at all?

Comment 11 Steve Baker 2015-09-24 02:59:57 UTC

It looks like this is happening because overcloud-resource-registry-puppet.yaml has parameter_defaults: {EnablePackageInstall: false}

The tripleo::packages puppet class has logic which requires enable_install (param EnablePackageInstall) to be true if enable_upgrade is set to true.

I think the appropriate fix would be for the command "openstack overcloud update stack" to invoke the stack-update operation with an environment that includes parameter_defaults: {EnablePackageInstall: true}

Comment 12 Zane Bitter 2015-09-24 22:22:02 UTC

Jan pointed out to me that that isn't a great solution, because the environment is now sticky, so it would become true on every subsequent stack update. Our options for preventing that would be:

1) Pass an environment with "parameter_defaults: {EnablePackageInstall: false}" on every command other than "openstack overcloud update"
2) Pass EnablePackageInstall as a parameter instead in "openstack overcloud update". In every other command, pass "--clear-parameter EnablePackageInstall" to go back to the default.

The second is slightly better than the first (which I'd say is just about a non-starter), but it's still unmaintainably error-prone. Another potential wrinkle is that I assume this parameter is passed to all of the Puppet SoftwareDeployments, and this constant change to an input could cause Heat to re-run them when it shouldn't. I think we should consider other options:

3) Get rid of the logic in tripleo::packages that is blocking on this. If somebody went to the trouble of setting enable_upgrade why stop them by making them jump through extra hoops?
4) Set enable_install through the same mechanism that we're setting enable_upgrade instead of through the environment.

Comment 13 Steve Baker 2015-09-25 01:47:47 UTC

(In reply to Zane Bitter from comment #12)

> 3) Get rid of the logic in tripleo::packages that is blocking on this. If
> somebody went to the trouble of setting enable_upgrade why stop them by
> making them jump through extra hoops?

This would be my preferred option. It would mean changing the puppet logic to say enable_update implies enable_install, so:

http://git.openstack.org/cgit/openstack/puppet-tripleo/tree/manifests/packages.pp#n35 would become

  if !$enable_install and !$enable_update {

and the following if block would be removed http://git.openstack.org/cgit/openstack/puppet-tripleo/tree/manifests/packages.pp#n47

I'm assigning this to openstack-puppet-modules

Comment 14 Dan Prince 2015-09-28 16:30:38 UTC

I concur that Steve Baker's solution is probably the best option we have for this. I've posted a patch to simplify the use of tripleo::packages when enable_upgrade is set here:

https://review.openstack.org/228532

Comment 15 Emilien Macchi 2015-09-28 16:31:38 UTC

Patch sent upstream by dprince: https://review.openstack.org/#/c/228532/

Comment 21 Alexander Chuzhoy 2015-09-28 21:25:20 UTC

FailedQA 
Environment:
openstack-puppet-modules-2015.1.8-20.el7ost.noarch
comment #17
comment #18
comment #19

Comment 22 Steve Baker 2015-09-28 21:37:41 UTC

The timeout script is looking for value of `hostname` in the output of pcs status, but hostname=overcloud-controller-1.localdomain and the pacemaker node name is overcloud-controller-1

Comment 23 Zane Bitter 2015-09-28 21:42:58 UTC

This is actually a problem with the fix for bug 1261921, so that one should go to FailedQA, not this one (it didn't actually get far enough to test the fix for this one).

Comment 24 Alexander Chuzhoy 2015-09-28 22:11:19 UTC

Setting back to on_qa.
This bug depends on bug #1261921

Comment 25 Alexander Chuzhoy 2015-09-30 23:43:09 UTC

Verified:


Environment:
openstack-puppet-modules-2015.1.8-21.el7ost.noarch

The update complets successfully now:

...

IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
IN_PROGRESS
COMPLETE
update finished with status COMPLETE

Comment 27 errata-xmlrpc 2015-10-08 12:24:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:1872