Bug 1559889

Summary:	[OSP13] HostPrepDeployment not always pushed on all nodes
Product:	Red Hat OpenStack	Reporter:	Michele Baldessari <michele>
Component:	openstack-heat	Assignee:	Zane Bitter <zbitter>
Status:	CLOSED ERRATA	QA Contact:	Ronnie Rasouli <rrasouli>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	13.0 (Queens)	CC:	aschultz, dpeacock, jschluet, mburns, mcornea, michele, ohochman, rhel-osp-director-maint, sbaker, shardy, srevivo, therve
Target Milestone:	beta	Keywords:	Triaged
Target Release:	13.0 (Queens)
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	openstack-heat-10.0.1-0.20180314232330.c2a66b1.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-06-27 13:48:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Michele Baldessari 2018-03-23 13:54:56 UTC

Description of problem:
I have seen this problem only twice so far. The symptoms are a hanging deployment and the hanging resource in my case was the HostPrepDeployment one.
We see that the deployment is stuck on some Hosts* resource that was scheduled on a node. If you go on said node, you'll see nothing happening.

For example we can see the resources hanging:
| AllNodesDeploySteps                           | b58bee24-76d5-4a53-8d48-3e76b91407b7                                                                                                                                                 | OS::TripleO::PostDeploySteps                                                                                                     | CREATE_IN_PROGRESS | 2018-03-23T12:00:52Z | overcloud                                                                                                                                                |
| DatabaseHostPrepDeployment                    | 92469eaf-98be-4fbd-9d98-4101659cceb6                                                                                                                                                 | OS::Heat::SoftwareDeploymentGroup                                                                                                | CREATE_IN_PROGRESS | 2018-03-23T12:16:50Z | overcloud-AllNodesDeploySteps-sapadflv4r6r                                                                                                               |
| 2                                             | af13b8fb-5926-4167-b5f7-16354864dca6                                                                                                                                                 | OS::Heat::SoftwareDeployment                                                                                                     | CREATE_IN_PROGRESS | 2018-03-23T12:16:54Z | overcloud-AllNodesDeploySteps-sapadflv4r6r-DatabaseHostPrepDeployment-kd4qwvkin4zk


Now if we hop on all nodes to check if os-collect-config did indeed get this resource to be completed on the node we get the following:
$ for i in $(nova list |grep ctlplane | awk '{ print $12 }' | cut -f2 -d=); do echo $i; ssh $i "sudo os-collect-config --print |grep -e 'name.*HostPrepDepl'"; done 2> /dev/null
192.168.24.10
    "name": "CephStorageHostPrepDeployment", 
192.168.24.8
    "name": "ControllerHostPrepDeployment", 
192.168.24.21
    "name": "ControllerHostPrepDeployment", 
192.168.24.12
    "name": "ControllerHostPrepDeployment", 
192.168.24.14
    "name": "DatabaseHostPrepDeployment", 
192.168.24.16
    "name": "DatabaseHostPrepDeployment", 
192.168.24.18 <-- BOOOM overcloud-database-2 has not gotten the memo
192.168.24.6
    "name": "MessagingHostPrepDeployment", 
192.168.24.9
    "name": "MessagingHostPrepDeployment", 
192.168.24.11
    "name": "MessagingHostPrepDeployment", 
192.168.24.19
    "name": "ComputeInstanceHAHostPrepDeployment", 



So we're basically in the following state:
A) Heat on the undercloud waits for DatabaseHostPrepDeployment to be completed on overcloud-database-2
B) overcloud-database-2 simply has no idea it needs to run that task

A redeploy will usually not show this (rather rate) problem.
Undercloud sosreport http://file.rdu.redhat.com/~mbaldess/bz-osp13-heat-issue/sosreport-undercloud-0.redhat.local-20180323085448.tar.xz
DB dump: http://file.rdu.redhat.com/~mbaldess/bz-osp13-heat-issue/db.dump.gz
All other sosreports here:
http://file.rdu.redhat.com/~mbaldess/bz-osp13-heat-issue/

Version-Release number of selected component (if applicable):
$ rpm -qa |grep heat
openstack-heat-api-cfn-10.0.1-0.20180302152334.c3bd928.el7ost.noarch
python-heat-agent-1.5.4-0.20180301153730.ecf43c7.el7ost.noarch
openstack-tripleo-heat-templates-8.0.0-0.20180304031148.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
openstack-heat-api-10.0.1-0.20180302152334.c3bd928.el7ost.noarch
puppet-heat-12.3.1-0.20180221104603.27feed4.el7ost.noarch
python2-heatclient-1.14.0-0.20180213175737.2ce6aa1.el7ost.noarch
openstack-heat-engine-10.0.1-0.20180302152334.c3bd928.el7ost.noarch
openstack-heat-common-10.0.1-0.20180302152334.c3bd928.el7ost.noarch

Comment 1 Michele Baldessari 2018-03-23 14:48:04 UTC

Chatted with therve, this is likely https://bugs.launchpad.net/tripleo/+bug/1731032

Comment 2 Michele Baldessari 2018-03-23 15:01:22 UTC

(In reply to Michele Baldessari from comment #1)
> Chatted with therve, this is likely
> https://bugs.launchpad.net/tripleo/+bug/1731032

Seems we already have the patches from the LP bug in this package (aka https://review.openstack.org/#/c/521468/), so this is likely something else

Comment 3 Thomas Hervé 2018-03-23 15:34:52 UTC

I looked at it a little bit, and it really looks like https://bugs.launchpad.net/tripleo/+bug/1731032 which ought to be fixed. The database is correct and contains the deployments metadata, but the node doesn't get it. I don't have the full Heat logs, that would be useful. Having swift logs would help too.

Comment 4 Michele Baldessari 2018-03-23 15:44:05 UTC

I managed to take /var/log from the undercloud before the whole system got reprovisioned: http://file.rdu.redhat.com/~mbaldess/bz-osp13-heat-issue/var-log.tgz maybe it helps a bit?

Comment 5 Thomas Hervé 2018-03-23 15:46:43 UTC

Ah found swift logs in journal, definitely this issue:

Mar 23 08:17:06 undercloud-0 proxy-server: 192.168.24.1 192.168.24.1 23/Mar/2018/12/17/06 PUT /v1/AUTH_f2cc1dec6db6422d82a5a8ef5556af52/ov--tw7r7g2htknr-2-s2w5wltduwsp-Database-xcon3myy6gbe/513ce19d-3a32-4ced-854e-ec31eda20744%3Ftemp_url_expires%3D2147483586%26temp_url_sig%3D394269aa972c830c59b731546d34e94d15d49738 HTTP/1.0 201 - python-requests/2.14.2 - 159751 - - txd4639a87568f40ffa6f23-005ab4f042 - 0.0108 - - 1521807426.890296936 1521807426.901119947 0

Mar 23 08:17:07 undercloud-0 proxy-server: 192.168.24.1 192.168.24.1 23/Mar/2018/12/17/07 PUT /v1/AUTH_f2cc1dec6db6422d82a5a8ef5556af52/ov--tw7r7g2htknr-2-s2w5wltduwsp-Database-xcon3myy6gbe/513ce19d-3a32-4ced-854e-ec31eda20744%3Ftemp_url_expires%3D2147483586%26temp_url_sig%3D394269aa972c830c59b731546d34e94d15d49738 HTTP/1.0 201 - python-requests/2.14.2 - 139580 - - tx8af2d2746ec54f86a2809-005ab4f043 - 0.0190 - - 1521807427.385188103 1521807427.404186010 0

We see 2 PUTs on the server metadata, and the 2nd is smaller than the first (139580 vs 159751).

It looks like https://review.openstack.org/#/c/521468/ didn't work.

Comment 6 Zane Bitter 2018-03-23 16:15:50 UTC

Looking at the code for the Swift object server, it looks like on PUT Swift supports If-None-Match, but not If-Match:

http://git.openstack.org/cgit/openstack/swift/tree/swift/obj/server.py#n738

This is also consistent with the API docs (although I don't put too much weight on the docs): https://developer.openstack.org/api-ref/object-store/

So it's likely that https://review.openstack.org/#/c/521468/ didn't achieve anything. We could resurrect https://review.openstack.org/#/c/521204/ to be sure to be sure.

Comment 10 Omri Hochman 2018-04-03 16:35:38 UTC

Issue blocks overcloud deployment, not happens all the time, but it reproduced downstream on latest osp13 puddle 2018-03-29.1
 
Raising Severity, Adding Blocker(?) flag.

Comment 13 Omri Hochman 2018-04-16 19:01:17 UTC

unable to reproduce with: 
openstack-heat-engine-10.0.1-0.20180314232330.c2a66b1.el7ost.noarch

Comment 15 errata-xmlrpc 2018-06-27 13:48:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086